upstage-document-parse

介绍

# Upstage Document Parse

使用 Upstage 的文档解析 API 从文档中提取结构化内容。

## 支持的格式

PDF（异步模式下最多 1000 页）、PNG、JPG、JPEG、TIFF、BMP、GIF、WEBP、DOCX、PPTX、XLSX、HWP

## 安装

```bash openclaw install upstage-document-parse ```

## API 密钥设置

1. 从 [Upstage Console](https://console.upstage.ai) 获取您的 API 密钥 2. 配置 API 密钥：

```bash openclaw config set skills.entries.upstage-document-parse.apiKey "your-api-key" ```

或添加到 `~/.openclaw/openclaw.json`：

```json5 { "skills": { "entries": { "upstage-document-parse": { "apiKey": "your-api-key" } } } } ```

## 使用示例

只需让代理解析您的文档：

``` "Parse this PDF: ~/Documents/report.pdf" "Parse: ~/Documents/report.jpg" ```

---

## 同步 API（小型文档）

适用于小型文档（建议少于 20 页）。

### 参数

| 参数 | 类型 | 默认值 | 描述 | |-----------|------|---------|-------------| | `model` | string | 必填 | 使用 `document-parse`（最新版）或 `document-parse-nightly` | | `document` | file | 必填 | 要解析的文档文件 | | `mode` | string | `standard` | `standard`（侧重文本）、`enhanced`（复杂表格/图片）、`auto` | | `ocr` | string | `auto` | `auto`（仅限图片）或 `force`（始终 OCR） | | `output_formats` | string | `['html']` | `text`、`html`、`markdown`（数组格式） | | `coordinates` | boolean | `true` | 包含边界框坐标 | | `base64_encoding` | string | `[]` | 转换为 base64 的元素：`["table"]`、`["figure"]` 等 | | `chart_recognition` | boolean | `true` | 将图表转换为表格（Beta） | | `merge_multipage_tables` | boolean | `false` | 合并跨页表格（Beta，若启用则最多 20 页） |

### 基础解析

```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "document=@/path/to/file.pdf" \ -F "model=document-parse" ```

### 提取 Markdown

```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "output_formats=['markdown']" ```

### 复杂文档的增强模式

### 扫描文档的强制 OCR

```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "ocr=force" ```

### 提取表格图片为 Base64

---

## 响应结构

```json { "api": "2.0", "model": "document-parse-251217", "content": { "html": "<h1>...</h1>", "markdown": "# ...", "text": "..." }, "elements": [ { "id": 0, "category": "heading1", "content": { "html": "...", "markdown": "...", "text": "..." }, "page": 1, "coordinates": [{"x": 0.06, "y": 0.05}, ...] } ], "usage": { "pages": 1 } } ```

### 元素类别

`paragraph`（段落）、`heading1`（标题 1）、`heading2`（标题 2）、`heading3`（标题 3）、`list`（列表）、`table`（表格）、`figure`（图片）、`chart`（图表）、`equation`（公式）、`caption`（说明）、`header`（页眉）、`footer`（页脚）、`index`（索引）、`footnote`（脚注）

---

## 异步 API（大型文档）

适用于最多 1000 页的文档。文档按每页 10 页分批处理。

### 提交请求

```bash curl -X POST "https://api.upstage.ai/v1/document-digitization/async" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "output_formats=['markdown']" ```

响应： ```json {"request_id": "uuid-here"} ```

### 检查状态并获取结果

```bash curl "https://api.upstage.ai/v1/document-digitization/requests/{request_id}" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" ```

响应包含每个批次的 `download_url`（有效期 30 天）。

### 列出所有请求

```bash curl "https://api.upstage.ai/v1/document-digitization/requests" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" ```

### 状态值

- `submitted`：已收到请求 - `started`：正在处理 - `completed`：准备下载 - `failed`：发生错误（检查 `failure_message`）

### 注意事项

- 结果存储 30 天 - 下载 URL 15 分钟后过期（重新获取状态以获取新 URL） - 文档被拆分为最多 10 页的批次

---

## Python 使用

```python import requests

api_key = "up_xxx"

# Sync with open("doc.pdf", "rb") as f: response = requests.post( "https://api.upstage.ai/v1/document-digitization", headers={"Authorization": f"Bearer {api_key}"}, files={"document": f}, data={"model": "document-parse", "output_formats": "['markdown']"} ) print(response.json()["content"]["markdown"])

# Async for large docs with open("large.pdf", "rb") as f: r = requests.post( "https://api.upstage.ai/v1/document-digitization/async", headers={"Authorization": f"Bearer {api_key}"}, files={"document": f}, data={"model": "document-parse"} ) request_id = r.json()["request_id"]

# Poll for results import time while True: status = requests.get( f"https://api.upstage.ai/v1/document-digitization/requests/{request_id}", headers={"Authorization": f"Bearer {api_key}"} ).json() if status["status"] == "completed": break time.sleep(5) ```

## LangChain 集成

```python from langchain_upstage import UpstageDocumentParseLoader

loader = UpstageDocumentParseLoader( file_path="document.pdf", output_format="markdown", ocr="auto" ) docs = loader.load() ```

---

## 环境变量（备选方案）

您也可以将 API 密钥设置为环境变量：

```bash export UPSTAGE_API_KEY="your-api-key" ```

---

## 提示

- 对于复杂表格、图表和图片，使用 `mode=enhanced` - 使用 `mode=auto` 让 API 逐页决定 - 对于超过 20 页的文档，使用异步 API - 对于扫描的 PDF 或图片，使用 `ocr=force` - `merge_multipage_tables=true` 会合并拆分的表格（增强模式下最多 20 页） - 异步 API 的结果保留 30 天 - 服务器端超时：每个请求 5 分钟（同步 API） - 标准文档处理时间约为 3 秒

upstage-document-parse

介绍

更多产品

Summarize

Ontology

Nano Pdf