Documentation Index
Fetch the complete documentation index at: https://firecrawl.sec-lab.cn/llms.txt
Use this file to discover all available pages before exploring further.
使用Firecrawl抓取和提取结构化数据
Firecrawl利用大型语言模型(LLMs)高效地从网页中提取结构化数据。以下是方法:
-
模式定义:
定义要抓取的URL和使用JSON Schema(遵循OpenAI工具模式)的所需数据模式。此模式指定您期望从页面中提取的数据结构。
-
抓取端点:
将URL和模式传递给抓取端点。此端点的文档可在此处找到:
抓取端点文档
-
结构化数据检索:
接收以您的模式定义的结构化格式的抓取数据。然后,您可以根据需要在应用程序中使用此数据或进行进一步处理。
这种方法简化了数据提取,减少了手动处理并提高了效率。
提取结构化数据
用于从抓取的页面中提取结构化数据。
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field
# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')
class ExtractSchema(BaseModel):
company_mission: str
supports_sso: bool
is_open_source: bool
is_in_yc: bool
data = app.scrape_url('https://docs.firecrawl.dev/', {
'formats': ['extract'],
'extract': {
'schema': ExtractSchema.model_json_schema(),
}
})
print(data["extract"])
输出:
{
"success": true,
"data": {
"extract": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
"supports_sso": true,
"is_open_source": false,
"is_in_yc": true
},
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://docs.firecrawl.dev/",
"ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://docs.firecrawl.dev/"
},
}
}
无模式提取(新功能)
现在您可以通过仅向端点传递prompt来进行无模式提取。LLM会选择数据的结构。
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["extract"],
"extract": {
"prompt": "从页面中提取公司使命。"
}
}'
输出:
{
"success": true,
"data": {
"extract": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
},
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://docs.firecrawl.dev/",
"ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://docs.firecrawl.dev/"
},
}
}
extract对象接受以下参数:
schema:用于提取的模式。
systemPrompt:用于提取的系统提示。
prompt:用于无模式提取的提示。