Skip to main content
Firecrawl将网页转换为markdown,非常适合LLM应用。
  • 它管理复杂性:代理、缓存、速率限制、js阻止的内容
  • 处理动态内容:动态网站、js渲染的网站、PDF、图像
  • 输出干净的markdown、结构化数据、截图或html。
详情请参阅抓取端点API参考

使用Firecrawl抓取URL

/scrape 端点

用于抓取URL并获取其内容。

安装

# 安装firecrawl-py包
pip install firecrawl-py

使用方法

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result)
有关参数的更多详细信息,请参阅API参考

响应

SDK将直接返回数据对象。cURL将返回与下面完全相同的有效负载。
{
  "success": true,
  "data" : {
    "markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
    "html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
    "metadata": {
      "title": "Home - Firecrawl",
      "description": "Firecrawl crawls and converts any website into clean markdown.",
      "language": "en",
      "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "Turn any website into LLM-ready data.",
      "ogUrl": "https://www.firecrawl.dev/",
      "ogImage": "https://www.firecrawl.dev/og.png?123",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "https://firecrawl.dev",
      "statusCode": 200
    }
  }
}

提取结构化数据

/scrape(带extract)端点

用于从抓取的页面中提取结构化数据。
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['extract'],
    'extract': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["extract"])
输出:
JSON
{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
        "supports_sso": true,
        "is_open_source": false,
        "is_in_yc": true
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

无模式提取(新功能)

现在您可以通过仅向端点传递prompt来进行无模式提取。LLM会选择数据的结构。
curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["extract"],
      "extract": {
        "prompt": "从页面中提取公司使命。"
      }
    }'
输出:
JSON
{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

Extract对象

extract对象接受以下参数:
  • schema:用于提取的模式。
  • systemPrompt:用于提取的系统提示。
  • prompt:用于无模式提取的提示。

使用Actions与页面交互

Firecrawl允许您在抓取内容之前对网页执行各种操作。这对于与动态内容交互、在页面之间导航或访问需要用户交互的内容特别有用。 以下是如何使用actions导航到google.com,搜索Firecrawl,点击第一个结果并截图的示例。 在执行其他操作之前/之后使用wait操作几乎总是很重要的,以便给页面足够的加载时间。

示例

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', 
    params={
        'formats': ['markdown', 'html'], 
        'actions': [
            {"type": "wait", "milliseconds": 2000},
            {"type": "click", "selector": "textarea[title=\"Search\"]"},
            {"type": "wait", "milliseconds": 2000},
            {"type": "write", "text": "firecrawl"},
            {"type": "wait", "milliseconds": 2000},
            {"type": "press", "key": "ENTER"},
            {"type": "wait", "milliseconds": 3000},
            {"type": "click", "selector": "h3"},
            {"type": "wait", "milliseconds": 3000},
            {"type": "scrape"},
            {"type": "screenshot"}
        ]
    }
)
print(scrape_result)

输出

{
  "success": true,
  "data": {
    "markdown": "Our first Launch Week is over! [See the recap 🚀](blog/firecrawl-launch-week-1-recap)...",
    "actions": {
      "screenshots": [
        "https://alttmdsdujxrfnakrkyi.supabase.co/storage/v1/object/public/media/screenshot-75ef2d87-31e0-4349-a478-fb432a29e241.png"
      ],
      "scrapes": [
        {
          "url": "https://www.firecrawl.dev/",
          "html": "<html><body><h1>Firecrawl</h1></body></html>"
        }
      ]
    },
    "metadata": {
      "title": "Home - Firecrawl",
      "description": "Firecrawl crawls and converts any website into clean markdown.",
      "language": "en",
      "keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
      "robots": "follow, index",
      "ogTitle": "Firecrawl",
      "ogDescription": "Turn any website into LLM-ready data.",
      "ogUrl": "https://www.firecrawl.dev/",
      "ogImage": "https://www.firecrawl.dev/og.png?123",
      "ogLocaleAlternate": [],
      "ogSiteName": "Firecrawl",
      "sourceURL": "http://google.com",
      "statusCode": 200
    }
  }
}
有关actions参数的更多详细信息,请参阅API参考

批量抓取多个URL

您现在可以同时批量抓取多个URL。它接受起始URL和可选参数作为参数。params参数允许您为批量抓取任务指定其他选项,例如输出格式。

工作原理

它与/crawl端点的工作方式非常相似。它提交一个批量抓取任务并返回一个任务ID,用于检查批量抓取的状态。 SDK提供了2种方法,同步和异步。同步方法将返回批量抓取任务的结果,而异步方法将返回一个任务ID,您可以使用它来检查批量抓取的状态。

使用方法

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# 抓取多个网站:
batch_scrape_result = app.batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
print(batch_scrape_result)

# 或者,你可以使用异步方法:
batch_scrape_job = app.async_batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
print(batch_scrape_job)

# (异步) 然后你可以使用任务ID来检查批量抓取的状态:
batch_scrape_status = app.check_batch_scrape_status(batch_scrape_job['id'])
print(batch_scrape_status)

响应

如果您使用SDK的同步方法,它将返回批量抓取任务的结果。否则,它将返回一个任务ID,您可以使用它来检查批量抓取的状态。

同步

Completed
{
  "status": "completed",
  "total": 36,
  "completed": 36,
  "creditsUsed": 36,
  "expiresAt": "2024-00-00T00:00:00.000Z",
  "next": "https://api.firecrawl.dev/v1/crawl/123-456-789?skip=26",
  "data": [
    {
      "markdown": "[Firecrawl Docs home page![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
      "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
      "metadata": {
        "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
        "language": "en",
        "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
        "description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
        "ogLocaleAlternate": [],
        "statusCode": 200
      }
    },
    ...
  ]
}

异步

然后,您可以使用任务ID通过调用/batch/scrape/{id}端点来检查批量抓取的状态。此端点旨在在任务仍在运行或刚刚完成时使用**,因为批量抓取任务在24小时后过期**。
{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v1/batch/scrape/123-456-789"
}

位置和语言

指定国家和首选语言,根据您的目标位置和语言偏好获取相关内容。

工作原理

当您指定位置设置时,Firecrawl将使用适当的代理(如果可用)并模拟相应的语言和时区设置。默认情况下,如果未指定,位置设置为’US’。

使用方法

要使用位置和语言设置,请在请求正文中包含具有以下属性的location对象:
  • country:ISO 3166-1 alpha-2国家代码(例如,‘US’、‘AU’、‘DE’、‘JP’)。默认为’US’。
  • languages:请求的首选语言和区域设置数组,按优先级排序。默认为指定位置的语言。
from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# 爬取一个网站:
scrape_result = app.scrape_url('airbnb.com', 
    params={
        'formats': ['markdown', 'html'], 
        'location': {
            'country': 'BR',
            'languages': ['pt-BR']
        }
    }
)
print(scrape_result)