欢迎使用 V1

Firecrawl V1 已经到来！我们推出了更可靠、更开发者友好的 API。以下是新功能：

/scrape 的输出格式。选择您想要的输出格式。
新的 /map 端点，用于获取网页的大部分 URL。
开发者友好的 /crawl/{id} 状态 API。
所有计划的速率限制提高 2 倍。
Go SDK 和 Rust SDK
团队支持
仪表板中的 API 密钥管理。
onlyMainContent 现在默认为 true。
/crawl webhook 和 websocket 支持。

抓取格式

您现在可以选择想要的输出格式。您可以指定多种输出格式。支持的格式有：

Markdown (markdown)
HTML (html)
原始 HTML (rawHtml)（无修改）
截图 (screenshot 或 screenshot@fullPage)
链接 (links)

输出键将与您选择的格式匹配。

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result)

响应

SDK 将直接返回数据对象。cURL 将返回如下所示的完整负载。

{
  "success": true,
  "data" : {
    "markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
    "html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
    "metadata": {
      "title": "Home - Firecrawl",
      "description": "Firecrawl crawls and converts any website into clean markdown.",
      "language": "en",
      "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "Turn any website into LLM-ready data.",
      "ogUrl": "https://www.firecrawl.dev/",
      "ogImage": "https://www.firecrawl.dev/og.png?123",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "https://firecrawl.dev",
      "statusCode": 200
    }
  }
}

介绍 /map（Alpha 版）

从单个 URL 到整个网站地图的最简单方法。

用法

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Map a website:
map_result = app.map_url('https://firecrawl.dev')
print(map_result)

响应

SDK 将直接返回数据对象。cURL 将返回如下所示的完整负载。

{
  "status": "success",
  "links": [
    "https://firecrawl.dev",
    "https://www.firecrawl.dev/pricing",
    "https://www.firecrawl.dev/blog",
    "https://www.firecrawl.dev/playground",
    "https://www.firecrawl.dev/smart-crawl",
    ...
  ]
}

WebSockets

要使用 WebSockets 爬取网站，请使用 Crawl URL and Watch 方法。

# 在异步函数内部...
nest_asyncio.apply()

# 定义事件处理器
def on_document(detail):
    print("DOC", detail)

def on_error(detail):
    print("ERR", detail['error'])

def on_done(detail):
    print("DONE", detail['status'])

# 启动爬取和监视过程的函数
async def start_crawl_and_watch():
    # 初始化爬取任务并获取监视器
    watcher = app.crawl_url_and_watch('firecrawl.dev', { 'excludePaths': ['blog/*'], 'limit': 5 })

    # 添加事件监听器
    watcher.add_event_listener("document", on_document)
    watcher.add_event_listener("error", on_error)
    watcher.add_event_listener("done", on_done)

    # 启动监视器
    await watcher.connect()

# 运行事件循环
await start_crawl_and_watch()

提取格式

LLM 提取现在在 v1 中以 extract 格式提供。要从页面提取结构化数据，您可以向端点传递模式或仅提供提示。

from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['extract'],
    'extract': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["extract"])

输出：

JSON

{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
        "supports_sso": true,
        "is_open_source": false,
        "is_in_yc": true
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

无模式提取（新功能）

您现在可以通过仅向端点传递 prompt 来进行无模式提取。LLM 会选择数据的结构。

curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["extract"],
      "extract": {
        "prompt": "从页面中提取公司使命。"
      }
    }'

输出：

JSON

{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

新的爬取 Webhook

您现在可以向 /crawl 端点传递 webhook 参数。这将在爬取开始、更新和完成时向您指定的 URL 发送 POST 请求。 webhook 现在会为每个爬取的页面触发，而不仅仅是在最后提供整个结果。

cURL

# 发送POST请求到Firecrawl API，设置webhook接收爬取结果
curl -X POST https://api.firecrawl.dev/v1/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "limit": 100,
      "webhook": "https://example.com/webhook"
    }'

Webhook 事件

现在有 4 种类型的事件：

crawl.started - 爬取开始时触发。
crawl.page - 为每个爬取的页面触发。
crawl.completed - 爬取完成时触发，让您知道它已完成。
crawl.failed - 爬取失败时触发。

Webhook 响应

success - 如果 webhook 成功正确爬取页面。
type - 发生的事件类型。
id - 爬取的 ID。
data - 抓取的数据（数组）。这只会在 crawl.page 上非空，如果页面成功抓取，将包含 1 个项目。响应与 /scrape 端点相同。
error - 如果 webhook 失败，这将包含错误消息。

从 V0 迁移

/scrape 端点

更新后的 /scrape 端点经过重新设计，提高了可靠性和易用性。新的 /scrape 请求体结构如下：

{
  "url": "<string>",
  "formats": ["markdown", "html", "rawHtml", "links", "screenshot"],
  "includeTags": ["<string>"],
  "excludeTags": ["<string>"],
  "headers": { "<key>": "<value>" },
  "waitFor": 123,
  "timeout": 123
}

格式

您现在可以选择想要的输出格式。您可以指定多种输出格式。支持的格式有：

Markdown (markdown)
HTML (html)
原始 HTML (rawHtml)（无修改）
截图 (screenshot 或 screenshot@fullPage)
链接 (links)

默认情况下，输出将仅包含 markdown 格式。

新请求体的详细信息

下表概述了 V1 中 /scrape 端点请求体参数的变化。

参数	变化	描述
`onlyIncludeTags`	移动并重命名	移至根级别。并重命名为 `includeTags`。
`removeTags`	移动并重命名	移至根级别。并重命名为 `excludeTags`。
`onlyMainContent`	移动	移至根级别。默认为 `true`。
`waitFor`	移动	移至根级别。
`headers`	移动	移至根级别。
`parsePDF`	移动	移至根级别。
`extractorOptions`	无变化
`timeout`	无变化
`pageOptions`	移除	不再需要 `pageOptions` 参数。抓取选项已移至根级别。
`replaceAllPathsWithAbsolutePaths`	移除	不再需要 `replaceAllPathsWithAbsolutePaths`。现在每个路径默认为绝对路径。
`includeHtml`	移除	改为在 `formats` 中添加 `"html"`。
`includeRawHtml`	移除	改为在 `formats` 中添加 `"rawHtml"`。
`screenshot`	移除	改为在 `formats` 中添加 `"screenshot"`。
`fullPageScreenshot`	移除	改为在 `formats` 中添加 `"screenshot@fullPage"`。
`extractorOptions`	移除	改用带有 `extract` 对象的 `"extract"` 格式。

新的 extract 格式在 llm-extract 部分中描述。

/crawl 端点

我们还更新了 v1 上的 /crawl 端点。查看下面改进的请求体：

{
  "url": "<string>",
  "excludePaths": ["<string>"],
  "includePaths": ["<string>"],
  "maxDepth": 2,
  "ignoreSitemap": true,
  "limit": 10,
  "allowBackwardLinks": true,
  "allowExternalLinks": true,
  "scrapeOptions": {
    // 与 /scrape 中相同的选项
    "formats": ["markdown", "html", "rawHtml", "screenshot", "links"],
    "headers": { "<key>": "<value>" },
    "includeTags": ["<string>"],
    "excludeTags": ["<string>"],
    "onlyMainContent": true,
    "waitFor": 123
  }
}

新请求体的详细信息

下表概述了 V1 中 /crawl 端点请求体参数的变化。

参数	变化	描述
`pageOptions`	重命名	重命名为 `scrapeOptions`。
`includes`	移动并重命名	移至根级别。重命名为 `includePaths`。
`excludes`	移动并重命名	移至根级别。重命名为 `excludePaths`。
`allowBackwardCrawling`	移动并重命名	移至根级别。重命名为 `allowBackwardLinks`。
`allowExternalLinks`	移动	移至根级别。
`maxDepth`	移动	移至根级别。
`ignoreSitemap`	移动	移至根级别。
`limit`	移动	移至根级别。
`crawlerOptions`	移除	不再需要 `crawlerOptions` 参数。爬取选项已移至根级别。
`timeout`	移除	改用 `scrapeOptions` 中的 `timeout`。

开始使用

功能

测试版功能

集成

贡献

抓取格式

响应

介绍 /map（Alpha 版）

用法

响应

WebSockets

提取格式

无模式提取（新功能）

新的爬取 Webhook

Webhook 事件

Webhook 响应

从 V0 迁移

/scrape 端点

格式

新请求体的详细信息

/crawl 端点

新请求体的详细信息

开始使用

功能

测试版功能

集成

贡献

Documentation Index

​抓取格式

​响应

​介绍 /map（Alpha 版）

​用法

​响应

​WebSockets

​提取格式

​无模式提取（新功能）

​新的爬取 Webhook

​Webhook 事件

​Webhook 响应

​从 V0 迁移

​/scrape 端点

​格式

​新请求体的详细信息

​/crawl 端点

​新请求体的详细信息

抓取格式

响应

介绍 /map（Alpha 版）

用法

响应

WebSockets

提取格式

无模式提取（新功能）

新的爬取 Webhook

Webhook 事件

Webhook 响应

从 V0 迁移

/scrape 端点

格式

新请求体的详细信息

/crawl 端点

新请求体的详细信息