AnyCrawl-API

介绍

# AnyCrawl Skill

OpenClaw 的 AnyCrawl API 集成 - 使用高性能多线程爬取抓取、爬取和搜索 Web 内容。

## 设置

### 方法 1：环境变量（推荐）

```bash export ANYCRAWL_API_KEY="your-api-key" ```

将其添加到 `~/.bashrc` 或 `~/.zshrc` 以使其永久生效： ```bash echo 'export ANYCRAWL_API_KEY="your-api-key"' >> ~/.bashrc source ~/.bashrc ```

在以下地址获取您的 API Key：https://anycrawl.dev

### 方法 2：OpenClaw 网关配置

```bash openclaw config.patch --set ANYCRAWL_API_KEY="your-api-key" ```

## 函数

### 1. anycrawl_scrape

抓取单个 URL 并转换为 LLM 可用的结构化数据。

**参数：** - `url`（字符串，必填）：要抓取的 URL - `engine`（字符串，可选）：抓取引擎 - `"cheerio"`（默认）、`"playwright"`、`"puppeteer"` - `formats`（数组，可选）：输出格式 - `["markdown"]`、`["html"]`、`["text"]`、`["json"]`、`["screenshot"]` - `timeout`（数字，可选）：超时时间，以毫秒为单位（默认：30000） - `wait_for`（数字，可选）：提取前的延迟，以毫秒为单位（仅限浏览器引擎） - `wait_for_selector`（字符串/对象/数组，可选）：等待 CSS 选择器 - `include_tags`（数组，可选）：仅包含这些 HTML 标签（例如 `["h1", "p", "article"]`） - `exclude_tags`（数组，可选）：排除这些 HTML 标签 - `proxy`（字符串，可选）：代理 URL（例如 `"http://proxy:port"`） - `json_options`（对象，可选）：使用 schema/prompt 进行 JSON 提取 - `extract_source`（字符串，可选）：`"markdown"`（默认）或 `"html"`

**示例：**

```javascript // Basic scrape with default cheerio anycrawl_scrape({ url: "https://example.com" })

// Scrape SPA with Playwright anycrawl_scrape({ url: "https://spa-example.com", engine: "playwright", formats: ["markdown", "screenshot"] })

// Extract structured JSON anycrawl_scrape({ url: "https://product-page.com", engine: "cheerio", json_options: { schema: { type: "object", properties: { product_name: { type: "string" }, price: { type: "number" }, description: { type: "string" } }, required: ["product_name", "price"] }, user_prompt: "Extract product details from this page" } }) ```

### 2. anycrawl_search

搜索 Google 并返回结构化结果。

**参数：** - `query`（字符串，必填）：搜索查询 - `engine`（字符串，可选）：搜索引擎 - `"google"`（默认） - `limit`（数字，可选）：每页最大结果数（默认：10） - `offset`（数字，可选）：要跳过的结果数（默认：0） - `pages`（数字，可选）：要获取的页数（默认：1，最大：20） - `lang`（字符串，可选）：语言区域设置（例如 `"en"`、`"zh"`、`"vi"`） - `safe_search`（数字，可选）：0（关闭）、1（中等）、2（严格） - `scrape_options`（对象，可选）：使用这些选项抓取每个结果 URL

**示例：**

```javascript // Basic search anycrawl_search({ query: "OpenAI ChatGPT" })

// Multi-page search in Vietnamese anycrawl_search({ query: "hướng dẫn Node.js", pages: 3, lang: "vi" })

// Search and auto-scrape results anycrawl_search({ query: "best AI tools 2026", limit: 5, scrape_options: { engine: "cheerio", formats: ["markdown"] } }) ```

### 3. anycrawl_crawl_start

开始爬取整个网站（异步任务）。

**参数：** - `url`（字符串，必填）：开始爬取的种子 URL - `engine`（字符串，可选）：`"cheerio"`（默认）、`"playwright"`、`"puppeteer"` - `strategy`（字符串，可选）：`"all"`、`"same-domain"`（默认）、`"same-hostname"`、`"same-origin"` - `max_depth`（数字，可选）：距种子 URL 的最大深度（默认：10） - `limit`（数字，可选）：最大爬取页数（默认：100） - `include_paths`（数组，可选）：要包含的路径模式（例如 `["/blog/*"]`） - `exclude_paths`（数组，可选）：要排除的路径模式（例如 `["/admin/*"]`） - `scrape_paths`（数组，可选）：仅抓取匹配这些模式的 URL - `scrape_options`（对象，可选）：每页抓取选项

**示例：**

```javascript // Crawl entire website anycrawl_crawl_start({ url: "https://docs.example.com", engine: "cheerio", max_depth: 5, limit: 50 })

// Crawl only blog posts anycrawl_crawl_start({ url: "https://example.com", strategy: "same-domain", include_paths: ["/blog/*"], exclude_paths: ["/blog/tags/*"], scrape_options: { formats: ["markdown"] } })

// Crawl product pages only anycrawl_crawl_start({ url: "https://shop.example.com", strategy: "same-domain", scrape_paths: ["/products/*"], limit: 200 }) ```

### 4. anycrawl_crawl_status

检查爬取任务状态。

**参数：** - `job_id`（字符串，必填）：爬取任务 ID

**示例：** ```javascript anycrawl_crawl_status({ job_id: "7a2e165d-8f81-4be6-9ef7-23222330a396" }) ```

### 5. anycrawl_crawl_results

获取爬取结果（分页）。

**参数：** - `job_id`（字符串，必填）：爬取任务 ID - `skip`（数字，可选）：要跳过的结果数（默认：0）

**示例：** ```javascript // Get first 100 results anycrawl_crawl_results({ job_id: "xxx", skip: 0 })

// Get next 100 results anycrawl_crawl_results({ job_id: "xxx", skip: 100 }) ```

### 6. anycrawl_crawl_cancel

取消正在运行的爬取任务。

**参数：** - `job_id`（字符串，必填）：爬取任务 ID

### 7. anycrawl_search_and_scrape

快速助手：搜索 Google 然后抓取顶部结果。

**参数：** - `query`（字符串，必填）：搜索查询 - `max_results`（数字，可选）：要抓取的最大结果数（默认：3） - `scrape_engine`（字符串，可选）：抓取引擎（默认：`"cheerio"`） - `formats`（数组，可选）：输出格式（默认：`["markdown"]`） - `lang`（字符串，可选）：搜索语言

**示例：** ```javascript anycrawl_search_and_scrape({ query: "latest AI news", max_results: 5, formats: ["markdown"] }) ```

## 引擎选择指南

| 引擎 | 最适合 | 速度 | JS 渲染 | |--------|----------|-------|--------------| | `cheerio` | 静态 HTML、新闻、博客 | ⚡ 最快 | ❌ 否 | | `playwright` | 单页应用 (SPA)、复杂的 Web 应用 | 🐢 较慢 | ✅ 是 | | `puppeteer` | Chrome 特定、指标 | 🐢 较慢 | ✅ 是 |

## 响应格式

所有响应遵循以下结构：

```json { "success": true, "data": { ... }, "message": "Optional message" } ```

错误响应： ```json { "success": false, "error": "Error type", "message": "Human-readable message" } ```

## 常见错误代码

- `400` - 错误的请求（验证错误） - `401` - 未经授权（无效的 API Key） - `402` - 需要付款（积分不足） - `404` - 未找到 - `429` - 超出速率限制 - `500` - 内部服务器错误

## API 限制

- 根据您的计划应用速率限制 - 爬取任务在 24 小时后过期 - 最大爬取限制：取决于积分

## 链接

- API 文档：https://docs.anycrawl.dev - 网站：https://anycrawl.dev - 演练场：https://anycrawl.dev/playground

介绍

更多产品

Tavily Web Search

Humanize AI text

Humanizer