介绍
# Upstage Document Parse
使用 Upstage 的文档解析 API 从文档中提取结构化内容。
## 支持的格式
PDF(异步模式下最多 1000 页)、PNG、JPG、JPEG、TIFF、BMP、GIF、WEBP、DOCX、PPTX、XLSX、HWP
## 安装
```bash openclaw install upstage-document-parse ```
## API 密钥设置
1. 从 [Upstage Console](https://console.upstage.ai) 获取您的 API 密钥 2. 配置 API 密钥:
```bash openclaw config set skills.entries.upstage-document-parse.apiKey "your-api-key" ```
或添加到 `~/.openclaw/openclaw.json`:
```json5 { "skills": { "entries": { "upstage-document-parse": { "apiKey": "your-api-key" } } } } ```
## 使用示例
只需让代理解析您的文档:
``` "Parse this PDF: ~/Documents/report.pdf" "Parse: ~/Documents/report.jpg" ```
---
## 同步 API(小型文档)
适用于小型文档(建议少于 20 页)。
### 参数
| 参数 | 类型 | 默认值 | 描述 | |-----------|------|---------|-------------| | `model` | string | 必填 | 使用 `document-parse`(最新版)或 `document-parse-nightly` | | `document` | file | 必填 | 要解析的文档文件 | | `mode` | string | `standard` | `standard`(侧重文本)、`enhanced`(复杂表格/图片)、`auto` | | `ocr` | string | `auto` | `auto`(仅限图片)或 `force`(始终 OCR) | | `output_formats` | string | `['html']` | `text`、`html`、`markdown`(数组格式) | | `coordinates` | boolean | `true` | 包含边界框坐标 | | `base64_encoding` | string | `[]` | 转换为 base64 的元素:`["table"]`、`["figure"]` 等 | | `chart_recognition` | boolean | `true` | 将图表转换为表格(Beta) | | `merge_multipage_tables` | boolean | `false` | 合并跨页表格(Beta,若启用则最多 20 页) |
### 基础解析
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "document=@/path/to/file.pdf" \ -F "model=document-parse" ```
### 提取 Markdown
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "output_formats=['markdown']" ```
### 复杂文档的增强模式
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "mode=enhanced" \ -F "output_formats=['html', 'markdown']" ```
### 扫描文档的强制 OCR
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "ocr=force" ```
### 提取表格图片为 Base64
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "base64_encoding=['table']" ```
---
## 响应结构
```json { "api": "2.0", "model": "document-parse-251217", "content": { "html": "<h1>...</h1>", "markdown": "# ...", "text": "..." }, "elements": [ { "id": 0, "category": "heading1", "content": { "html": "...", "markdown": "...", "text": "..." }, "page": 1, "coordinates": [{"x": 0.06, "y": 0.05}, ...] } ], "usage": { "pages": 1 } } ```
### 元素类别
`paragraph`(段落)、`heading1`(标题 1)、`heading2`(标题 2)、`heading3`(标题 3)、`list`(列表)、`table`(表格)、`figure`(图片)、`chart`(图表)、`equation`(公式)、`caption`(说明)、`header`(页眉)、`footer`(页脚)、`index`(索引)、`footnote`(脚注)
---
## 异步 API(大型文档)
适用于最多 1000 页的文档。文档按每页 10 页分批处理。
### 提交请求
```bash curl -X POST "https://api.upstage.ai/v1/document-digitization/async" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" \ -F "[email protected]" \ -F "model=document-parse" \ -F "output_formats=['markdown']" ```
响应: ```json {"request_id": "uuid-here"} ```
### 检查状态并获取结果
```bash curl "https://api.upstage.ai/v1/document-digitization/requests/{request_id}" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" ```
响应包含每个批次的 `download_url`(有效期 30 天)。
### 列出所有请求
```bash curl "https://api.upstage.ai/v1/document-digitization/requests" \ -H "Authorization: Bearer $UPSTAGE_API_KEY" ```
### 状态值
- `submitted`:已收到请求 - `started`:正在处理 - `completed`:准备下载 - `failed`:发生错误(检查 `failure_message`)
### 注意事项
- 结果存储 30 天 - 下载 URL 15 分钟后过期(重新获取状态以获取新 URL) - 文档被拆分为最多 10 页的批次
---
## Python 使用
```python import requests
api_key = "up_xxx"
# Sync with open("doc.pdf", "rb") as f: response = requests.post( "https://api.upstage.ai/v1/document-digitization", headers={"Authorization": f"Bearer {api_key}"}, files={"document": f}, data={"model": "document-parse", "output_formats": "['markdown']"} ) print(response.json()["content"]["markdown"])
# Async for large docs with open("large.pdf", "rb") as f: r = requests.post( "https://api.upstage.ai/v1/document-digitization/async", headers={"Authorization": f"Bearer {api_key}"}, files={"document": f}, data={"model": "document-parse"} ) request_id = r.json()["request_id"]
# Poll for results import time while True: status = requests.get( f"https://api.upstage.ai/v1/document-digitization/requests/{request_id}", headers={"Authorization": f"Bearer {api_key}"} ).json() if status["status"] == "completed": break time.sleep(5) ```
## LangChain 集成
```python from langchain_upstage import UpstageDocumentParseLoader
loader = UpstageDocumentParseLoader( file_path="document.pdf", output_format="markdown", ocr="auto" ) docs = loader.load() ```
---
## 环境变量(备选方案)
您也可以将 API 密钥设置为环境变量:
```bash export UPSTAGE_API_KEY="your-api-key" ```
---
## 提示
- 对于复杂表格、图表和图片,使用 `mode=enhanced` - 使用 `mode=auto` 让 API 逐页决定 - 对于超过 20 页的文档,使用异步 API - 对于扫描的 PDF 或图片,使用 `ocr=force` - `merge_multipage_tables=true` 会合并拆分的表格(增强模式下最多 20 页) - 异步 API 的结果保留 30 天 - 服务器端超时:每个请求 5 分钟(同步 API) - 标准文档处理时间约为 3 秒