Documentation Index
Fetch the complete documentation index at: https://firecrawl.sec-lab.cn/llms.txt
Use this file to discover all available pages before exploring further.
Firecrawl将网页转换为markdown,非常适合LLM应用。
- 它管理复杂性:代理、缓存、速率限制、js阻止的内容
- 处理动态内容:动态网站、js渲染的网站、PDF、图像
- 输出干净的markdown、结构化数据、截图或html。
详情请参阅抓取端点API参考。
使用Firecrawl抓取URL
/scrape 端点
用于抓取URL并获取其内容。
# 安装firecrawl-py包
pip install firecrawl-py
使用方法
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result)
有关参数的更多详细信息,请参阅API参考。
SDK将直接返回数据对象。cURL将返回与下面完全相同的有效负载。
{
"success": true,
"data" : {
"markdown": "Launch Week I is here! [See our Day 2 Release 🚀](https://www.firecrawl.dev/blog/launch-week-i-day-2-doubled-rate-limits)[💥 Get 2 months free...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"light\" style=\"color-scheme: light;\"><body class=\"__variable_36bd41 __variable_d7dc5d font-inter ...",
"metadata": {
"title": "Home - Firecrawl",
"description": "Firecrawl crawls and converts any website into clean markdown.",
"language": "en",
"keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "Turn any website into LLM-ready data.",
"ogUrl": "https://www.firecrawl.dev/",
"ogImage": "https://www.firecrawl.dev/og.png?123",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "https://firecrawl.dev",
"statusCode": 200
}
}
}
提取结构化数据
用于从抓取的页面中提取结构化数据。
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
class ExtractSchema(BaseModel):
company_mission: str
supports_sso: bool
is_open_source: bool
is_in_yc: bool
data = app.scrape_url('https://docs.firecrawl.dev/', {
'formats': ['extract'],
'extract': {
'schema': ExtractSchema.model_json_schema(),
}
})
print(data["extract"])
输出:
{
"success": true,
"data": {
"extract": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
"supports_sso": true,
"is_open_source": false,
"is_in_yc": true
},
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://docs.firecrawl.dev/",
"ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://docs.firecrawl.dev/"
},
}
}
无模式提取(新功能)
现在您可以通过仅向端点传递prompt来进行无模式提取。LLM会选择数据的结构。
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["extract"],
"extract": {
"prompt": "从页面中提取公司使命。"
}
}'
输出:
{
"success": true,
"data": {
"extract": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
},
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://docs.firecrawl.dev/",
"ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://docs.firecrawl.dev/"
},
}
}
extract对象接受以下参数:
schema:用于提取的模式。
systemPrompt:用于提取的系统提示。
prompt:用于无模式提取的提示。
使用Actions与页面交互
Firecrawl允许您在抓取内容之前对网页执行各种操作。这对于与动态内容交互、在页面之间导航或访问需要用户交互的内容特别有用。
以下是如何使用actions导航到google.com,搜索Firecrawl,点击第一个结果并截图的示例。
在执行其他操作之前/之后使用wait操作几乎总是很重要的,以便给页面足够的加载时间。
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev',
params={
'formats': ['markdown', 'html'],
'actions': [
{"type": "wait", "milliseconds": 2000},
{"type": "click", "selector": "textarea[title=\"Search\"]"},
{"type": "wait", "milliseconds": 2000},
{"type": "write", "text": "firecrawl"},
{"type": "wait", "milliseconds": 2000},
{"type": "press", "key": "ENTER"},
{"type": "wait", "milliseconds": 3000},
{"type": "click", "selector": "h3"},
{"type": "wait", "milliseconds": 3000},
{"type": "scrape"},
{"type": "screenshot"}
]
}
)
print(scrape_result)
{
"success": true,
"data": {
"markdown": "Our first Launch Week is over! [See the recap 🚀](blog/firecrawl-launch-week-1-recap)...",
"actions": {
"screenshots": [
"https://alttmdsdujxrfnakrkyi.supabase.co/storage/v1/object/public/media/screenshot-75ef2d87-31e0-4349-a478-fb432a29e241.png"
],
"scrapes": [
{
"url": "https://www.firecrawl.dev/",
"html": "<html><body><h1>Firecrawl</h1></body></html>"
}
]
},
"metadata": {
"title": "Home - Firecrawl",
"description": "Firecrawl crawls and converts any website into clean markdown.",
"language": "en",
"keywords": "Firecrawl,Markdown,Data,Mendable,Langchain",
"robots": "follow, index",
"ogTitle": "Firecrawl",
"ogDescription": "Turn any website into LLM-ready data.",
"ogUrl": "https://www.firecrawl.dev/",
"ogImage": "https://www.firecrawl.dev/og.png?123",
"ogLocaleAlternate": [],
"ogSiteName": "Firecrawl",
"sourceURL": "http://google.com",
"statusCode": 200
}
}
}
有关actions参数的更多详细信息,请参阅API参考。
批量抓取多个URL
您现在可以同时批量抓取多个URL。它接受起始URL和可选参数作为参数。params参数允许您为批量抓取任务指定其他选项,例如输出格式。
工作原理
它与/crawl端点的工作方式非常相似。它提交一个批量抓取任务并返回一个任务ID,用于检查批量抓取的状态。
SDK提供了2种方法,同步和异步。同步方法将返回批量抓取任务的结果,而异步方法将返回一个任务ID,您可以使用它来检查批量抓取的状态。
使用方法
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# 抓取多个网站:
batch_scrape_result = app.batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
print(batch_scrape_result)
# 或者,你可以使用异步方法:
batch_scrape_job = app.async_batch_scrape_urls(['firecrawl.dev', 'mendable.ai'], {'formats': ['markdown', 'html']})
print(batch_scrape_job)
# (异步) 然后你可以使用任务ID来检查批量抓取的状态:
batch_scrape_status = app.check_batch_scrape_status(batch_scrape_job['id'])
print(batch_scrape_status)
如果您使用SDK的同步方法,它将返回批量抓取任务的结果。否则,它将返回一个任务ID,您可以使用它来检查批量抓取的状态。
{
"status": "completed",
"total": 36,
"completed": 36,
"creditsUsed": 36,
"expiresAt": "2024-00-00T00:00:00.000Z",
"next": "https://api.firecrawl.dev/v1/crawl/123-456-789?skip=26",
"data": [
{
"markdown": "[Firecrawl Docs home page!...",
"html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
"metadata": {
"title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
"language": "en",
"sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
"description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
"ogLocaleAlternate": [],
"statusCode": 200
}
},
...
]
}
然后,您可以使用任务ID通过调用/batch/scrape/{id}端点来检查批量抓取的状态。此端点旨在在任务仍在运行或刚刚完成时使用**,因为批量抓取任务在24小时后过期**。
{
"success": true,
"id": "123-456-789",
"url": "https://api.firecrawl.dev/v1/batch/scrape/123-456-789"
}
位置和语言
指定国家和首选语言,根据您的目标位置和语言偏好获取相关内容。
工作原理
当您指定位置设置时,Firecrawl将使用适当的代理(如果可用)并模拟相应的语言和时区设置。默认情况下,如果未指定,位置设置为’US’。
使用方法
要使用位置和语言设置,请在请求正文中包含具有以下属性的location对象:
country:ISO 3166-1 alpha-2国家代码(例如,‘US’、‘AU’、‘DE’、‘JP’)。默认为’US’。
languages:请求的首选语言和区域设置数组,按优先级排序。默认为指定位置的语言。
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
# 爬取一个网站:
scrape_result = app.scrape_url('airbnb.com',
params={
'formats': ['markdown', 'html'],
'location': {
'country': 'BR',
'languages': ['pt-BR']
}
}
)
print(scrape_result)