ClawSkills logoClawSkills

AnyCrawl-API

Perform high-performance web scraping, crawling, and Google search with multi-engine support and structured data extraction via AnyCrawl API.

Introduction

# AnyCrawl Skill

AnyCrawl API integration for OpenClaw - Scrape, Crawl, and Search web content with high-performance multi-threaded crawling.

## Setup

### Method 1: Environment variable (Recommended)

```bash export ANYCRAWL_API_KEY="your-api-key" ```

Make it permanent by adding to `~/.bashrc` or `~/.zshrc`: ```bash echo 'export ANYCRAWL_API_KEY="your-api-key"' >> ~/.bashrc source ~/.bashrc ```

Get your API key at: https://anycrawl.dev

### Method 2: OpenClaw gateway config

```bash openclaw config.patch --set ANYCRAWL_API_KEY="your-api-key" ```

## Functions

### 1. anycrawl_scrape

Scrape a single URL and convert to LLM-ready structured data.

**Parameters:** - `url` (string, required): URL to scrape - `engine` (string, optional): Scraping engine - `"cheerio"` (default), `"playwright"`, `"puppeteer"` - `formats` (array, optional): Output formats - `["markdown"]`, `["html"]`, `["text"]`, `["json"]`, `["screenshot"]` - `timeout` (number, optional): Timeout in milliseconds (default: 30000) - `wait_for` (number, optional): Delay before extraction in ms (browser engines only) - `wait_for_selector` (string/object/array, optional): Wait for CSS selectors - `include_tags` (array, optional): Include only these HTML tags (e.g., `["h1", "p", "article"]`) - `exclude_tags` (array, optional): Exclude these HTML tags - `proxy` (string, optional): Proxy URL (e.g., `"http://proxy:port"`) - `json_options` (object, optional): JSON extraction with schema/prompt - `extract_source` (string, optional): `"markdown"` (default) or `"html"`

**Examples:**

```javascript // Basic scrape with default cheerio anycrawl_scrape({ url: "https://example.com" })

// Scrape SPA with Playwright anycrawl_scrape({ url: "https://spa-example.com", engine: "playwright", formats: ["markdown", "screenshot"] })

// Extract structured JSON anycrawl_scrape({ url: "https://product-page.com", engine: "cheerio", json_options: { schema: { type: "object", properties: { product_name: { type: "string" }, price: { type: "number" }, description: { type: "string" } }, required: ["product_name", "price"] }, user_prompt: "Extract product details from this page" } }) ```

### 2. anycrawl_search

Search Google and return structured results.

**Parameters:** - `query` (string, required): Search query - `engine` (string, optional): Search engine - `"google"` (default) - `limit` (number, optional): Max results per page (default: 10) - `offset` (number, optional): Number of results to skip (default: 0) - `pages` (number, optional): Number of pages to retrieve (default: 1, max: 20) - `lang` (string, optional): Language locale (e.g., `"en"`, `"zh"`, `"vi"`) - `safe_search` (number, optional): 0 (off), 1 (medium), 2 (high) - `scrape_options` (object, optional): Scrape each result URL with these options

**Examples:**

```javascript // Basic search anycrawl_search({ query: "OpenAI ChatGPT" })

// Multi-page search in Vietnamese anycrawl_search({ query: "hướng dẫn Node.js", pages: 3, lang: "vi" })

// Search and auto-scrape results anycrawl_search({ query: "best AI tools 2026", limit: 5, scrape_options: { engine: "cheerio", formats: ["markdown"] } }) ```

### 3. anycrawl_crawl_start

Start crawling an entire website (async job).

**Parameters:** - `url` (string, required): Seed URL to start crawling - `engine` (string, optional): `"cheerio"` (default), `"playwright"`, `"puppeteer"` - `strategy` (string, optional): `"all"`, `"same-domain"` (default), `"same-hostname"`, `"same-origin"` - `max_depth` (number, optional): Max depth from seed URL (default: 10) - `limit` (number, optional): Max pages to crawl (default: 100) - `include_paths` (array, optional): Path patterns to include (e.g., `["/blog/*"]`) - `exclude_paths` (array, optional): Path patterns to exclude (e.g., `["/admin/*"]`) - `scrape_paths` (array, optional): Only scrape URLs matching these patterns - `scrape_options` (object, optional): Per-page scrape options

**Examples:**

```javascript // Crawl entire website anycrawl_crawl_start({ url: "https://docs.example.com", engine: "cheerio", max_depth: 5, limit: 50 })

// Crawl only blog posts anycrawl_crawl_start({ url: "https://example.com", strategy: "same-domain", include_paths: ["/blog/*"], exclude_paths: ["/blog/tags/*"], scrape_options: { formats: ["markdown"] } })

// Crawl product pages only anycrawl_crawl_start({ url: "https://shop.example.com", strategy: "same-domain", scrape_paths: ["/products/*"], limit: 200 }) ```

### 4. anycrawl_crawl_status

Check crawl job status.

**Parameters:** - `job_id` (string, required): Crawl job ID

**Example:** ```javascript anycrawl_crawl_status({ job_id: "7a2e165d-8f81-4be6-9ef7-23222330a396" }) ```

### 5. anycrawl_crawl_results

Get crawl results (paginated).

**Parameters:** - `job_id` (string, required): Crawl job ID - `skip` (number, optional): Number of results to skip (default: 0)

**Example:** ```javascript // Get first 100 results anycrawl_crawl_results({ job_id: "xxx", skip: 0 })

// Get next 100 results anycrawl_crawl_results({ job_id: "xxx", skip: 100 }) ```

### 6. anycrawl_crawl_cancel

Cancel a running crawl job.

**Parameters:** - `job_id` (string, required): Crawl job ID

### 7. anycrawl_search_and_scrape

Quick helper: Search Google then scrape top results.

**Parameters:** - `query` (string, required): Search query - `max_results` (number, optional): Max results to scrape (default: 3) - `scrape_engine` (string, optional): Engine for scraping (default: `"cheerio"`) - `formats` (array, optional): Output formats (default: `["markdown"]`) - `lang` (string, optional): Search language

**Example:** ```javascript anycrawl_search_and_scrape({ query: "latest AI news", max_results: 5, formats: ["markdown"] }) ```

## Engine Selection Guide

| Engine | Best For | Speed | JS Rendering | |--------|----------|-------|--------------| | `cheerio` | Static HTML, news, blogs | ⚡ Fastest | ❌ No | | `playwright` | SPAs, complex web apps | 🐢 Slower | ✅ Yes | | `puppeteer` | Chrome-specific, metrics | 🐢 Slower | ✅ Yes |

## Response Format

All responses follow this structure:

```json { "success": true, "data": { ... }, "message": "Optional message" } ```

Error response: ```json { "success": false, "error": "Error type", "message": "Human-readable message" } ```

## Common Error Codes

- `400` - Bad Request (validation errors) - `401` - Unauthorized (invalid API key) - `402` - Payment Required (insufficient credits) - `404` - Not Found - `429` - Rate limit exceeded - `500` - Internal server error

## API Limits

- Rate limits apply based on your plan - Crawl jobs expire after 24 hours - Max crawl limit: depends on credits

## Links

- API Docs: https://docs.anycrawl.dev - Website: https://anycrawl.dev - Playground: https://anycrawl.dev/playground

More Products