ClawSkills logoClawSkills

Firecrawl Skills

用于网页抓取、爬取和搜索的 Firecrawl CLI。抓取单个页面或整个网站,映射站点 URL,并完全提取内容进行网络搜索。Re

介绍

# Firecrawl CLI

使用 `firecrawl` CLI 来抓取和搜索网络。Firecrawl 返回针对 LLM 上下文窗口优化的干净 Markdown,处理 JavaScript 渲染,绕过常见拦截,并提供结构化数据。

## 安装

检查状态、身份验证和速率限制:

```bash firecrawl --status ```

就绪后的输出:

``` 🔥 firecrawl cli v1.0.2

● Authenticated via FIRECRAWL_API_KEY Concurrency: 0/100 jobs (parallel scrape limit) Credits: 500,000 remaining ```

- **Concurrency(并发数)**:最大并行任务数。运行接近但不高于此限制的并行操作。 - **Credits(额度)**:剩余 API 额度。每次抓取/爬取都会消耗额度。

如果未安装:`npm install -g firecrawl-cli`

如果用户未登录,请务必参阅 [rules/install.md](rules/install.md) 中的安装规则以获取更多信息。

## 身份验证

如果未通过身份验证,请运行:

```bash firecrawl login --browser ```

`--browser` 标志会自动打开浏览器进行身份验证,而无需提示。

## 组织结构

在工作目录中创建一个 `.firecrawl/` 文件夹(如果尚不存在)以存储结果。如果 `.gitignore` 文件中还没有 `.firecrawl/`,请将其添加进去。始终使用 `-o` 直接写入文件(避免充斥上下文):

```bash # Search the web (most common operation) firecrawl search "your query" -o .firecrawl/search-{query}.json

# Search with scraping enabled firecrawl search "your query" --scrape -o .firecrawl/search-{query}-scraped.json

# Scrape a page firecrawl scrape https://example.com -o .firecrawl/{site}-{path}.md ```

示例:

``` .firecrawl/search-react_server_components.json .firecrawl/search-ai_news-scraped.json .firecrawl/docs.github.com-actions-overview.md .firecrawl/firecrawl.dev.md ```

## 命令

### Search - 带有可选抓取的网络搜索

```bash # Basic search (human-readable output) firecrawl search "your query" -o .firecrawl/search-query.txt

# JSON output (recommended for parsing) firecrawl search "your query" -o .firecrawl/search-query.json --json

# Limit results firecrawl search "AI news" --limit 10 -o .firecrawl/search-ai-news.json --json

# Search specific sources firecrawl search "tech startups" --sources news -o .firecrawl/search-news.json --json firecrawl search "landscapes" --sources images -o .firecrawl/search-images.json --json firecrawl search "machine learning" --sources web,news,images -o .firecrawl/search-ml.json --json

# Filter by category (GitHub repos, research papers, PDFs) firecrawl search "web scraping python" --categories github -o .firecrawl/search-github.json --json firecrawl search "transformer architecture" --categories research -o .firecrawl/search-research.json --json

# Time-based search firecrawl search "AI announcements" --tbs qdr:d -o .firecrawl/search-today.json --json # Past day firecrawl search "tech news" --tbs qdr:w -o .firecrawl/search-week.json --json # Past week

# Location-based search firecrawl search "restaurants" --location "San Francisco,California,United States" -o .firecrawl/search-sf.json --json firecrawl search "local news" --country DE -o .firecrawl/search-germany.json --json

# Search AND scrape content from results firecrawl search "firecrawl tutorials" --scrape -o .firecrawl/search-scraped.json --json firecrawl search "API docs" --scrape --scrape-formats markdown,links -o .firecrawl/search-docs.json --json ```

**Search 选项:**

| 选项 | 描述 | |--------|-------------| | `--limit <n>` | 最大结果数(默认:5,最大:100) | | `--sources <sources>` | 逗号分隔:web、images、news(默认:web) | | `--categories <categories>` | 逗号分隔:github、research、pdf | | `--tbs <value>` | 时间过滤器:qdr:h(小时)、qdr:d(天)、qdr:w(周)、qdr:m(月)、qdr:y(年) | | `--location <location>` | 地理位置定位(例如,“Germany”) | | `--country <code>` | ISO 国家代码(默认:US) | | `--scrape` | 启用搜索结果抓取 | | `--scrape-formats <formats>` | 启用 --scrape 时的抓取格式(默认:markdown) | | `-o, --output <path>` | 保存到文件 |

### Scrape - 单页内容提取

```bash # Basic scrape (markdown output) firecrawl scrape https://example.com -o .firecrawl/example.md

# Get raw HTML firecrawl scrape https://example.com --html -o .firecrawl/example.html

# Multiple formats (JSON output) firecrawl scrape https://example.com --format markdown,links -o .firecrawl/example.json

# Main content only (removes nav, footer, ads) firecrawl scrape https://example.com --only-main-content -o .firecrawl/example.md

# Wait for JS to render firecrawl scrape https://spa-app.com --wait-for 3000 -o .firecrawl/spa.md

# Extract links only firecrawl scrape https://example.com --format links -o .firecrawl/links.json

# Include/exclude specific HTML tags firecrawl scrape https://example.com --include-tags article,main -o .firecrawl/article.md firecrawl scrape https://example.com --exclude-tags nav,aside,.ad -o .firecrawl/clean.md ```

**Scrape 选项:**

| 选项 | 描述 | |--------|-------------| | `-f, --format <formats>` | 输出格式:markdown、html、rawHtml、links、screenshot、json | | `-H, --html` | `--format html` 的快捷方式 | | `--only-main-content` | 仅提取主要内容 | | `--wait-for <ms>` | 抓取前等待(用于 JS 内容) | | `--include-tags <tags>` | 仅包含特定的 HTML 标签 | | `--exclude-tags <tags>` | 排除特定的 HTML 标签 | | `-o, --output <path>` | 保存到文件 |

### Crawl - 爬取整个网站

```bash # Start a crawl (returns job ID) firecrawl crawl https://example.com

# Wait for crawl to complete firecrawl crawl https://example.com --wait

# With progress indicator firecrawl crawl https://example.com --wait --progress

# Check crawl status firecrawl crawl <job-id>

# Limit pages firecrawl crawl https://example.com --limit 100 --max-depth 3

# Crawl blog section only firecrawl crawl https://example.com --include-paths /blog,/posts

# Exclude admin pages firecrawl crawl https://example.com --exclude-paths /admin,/login

# Crawl with rate limiting firecrawl crawl https://example.com --delay 1000 --max-concurrency 2

# Save results firecrawl crawl https://example.com --wait -o crawl-results.json --pretty ```

**Crawl 选项:**

| 选项 | 描述 | |--------|-------------| | `--wait` | 等待爬取完成 | | `--progress` | 等待时显示进度 | | `--limit <n>` | 最大爬取页数 | | `--max-depth <n>` | 最大爬取深度 | | `--include-paths <paths>` | 仅爬取匹配的路径 | | `--exclude-paths <paths>` | 跳过匹配的路径 | | `--delay <ms>` | 请求之间的延迟 | | `--max-concurrency <n>` | 最大并发请求数 |

### Map - 发现网站上的所有 URL

```bash # List all URLs (one per line) firecrawl map https://example.com -o .firecrawl/urls.txt

# Output as JSON firecrawl map https://example.com --json -o .firecrawl/urls.json

# Search for specific URLs firecrawl map https://example.com --search "blog" -o .firecrawl/blog-urls.txt

# Limit results firecrawl map https://example.com --limit 500 -o .firecrawl/urls.txt

# Include subdomains firecrawl map https://example.com --include-subdomains -o .firecrawl/all-urls.txt ```

**Map 选项:**

| 选项 | 描述 | |--------|-------------| | `--limit <n>` | 要发现的最大 URL 数 | | `--search <query>` | 按搜索查询过滤 URL | | `--sitemap <mode>` | include、skip 或 only | | `--include-subdomains` | 包含子域名 | | `--json` | 输出为 JSON | | `-o, --output <path>` | 保存到文件 |

### Credit Usage(额度使用情况)

```bash # Show credit usage firecrawl credit-usage

# Output as JSON firecrawl credit-usage --json --pretty ```

## 读取抓取的文件

除非明确要求,否则切勿一次性读取整个 firecrawl 输出文件——它们可能超过 1000 行。相反,请使用 grep、head 或增量读取:

```bash # Check file size and preview structure wc -l .firecrawl/file.md && head -50 .firecrawl/file.md

# Use grep to find specific content grep -n "keyword" .firecrawl/file.md grep -A 10 "## Section" .firecrawl/file.md ```

## 并行化

使用 `&` 和 `wait` 并行运行多个抓取任务:

```bash # Parallel scraping (fast) firecrawl scrape https://site1.com -o .firecrawl/1.md & firecrawl scrape https://site2.com -o .firecrawl/2.md & firecrawl scrape https://site3.com -o .firecrawl/3.md & wait ```

对于大量 URL,请使用带有 `-P` 参数的 xargs 进行并行执行:

```bash cat urls.txt | xargs -P 10 -I {} sh -c 'firecrawl scrape "{}" -o ".firecrawl/$(echo {} | md5).md"' ```

## 与其他工具结合使用

```bash # Extract URLs from search results jq -r '.data.web[].url' .firecrawl/search-query.json

# Get titles from search results jq -r '.data.web[] | "\(.title): \(.url)"' .firecrawl/search-query.json

# Count URLs from map firecrawl map https://example.com | wc -l ```

更多产品