Skip to main content

使用Firecrawl抓取和提取结构化数据

Firecrawl利用大型语言模型(LLMs)高效地从网页中提取结构化数据。以下是方法:
  1. 模式定义: 定义要抓取的URL和使用JSON Schema(遵循OpenAI工具模式)的所需数据模式。此模式指定您期望从页面中提取的数据结构。
  2. 抓取端点: 将URL和模式传递给抓取端点。此端点的文档可在此处找到: 抓取端点文档
  3. 结构化数据检索: 接收以您的模式定义的结构化格式的抓取数据。然后,您可以根据需要在应用程序中使用此数据或进行进一步处理。
这种方法简化了数据提取,减少了手动处理并提高了效率。

提取结构化数据

/scrape(带extract)端点

用于从抓取的页面中提取结构化数据。
from firecrawl import FirecrawlApp
from pydantic import BaseModel, Field

# 使用你的API密钥初始化FirecrawlApp
app = FirecrawlApp(api_key='your_api_key')

class ExtractSchema(BaseModel):
    company_mission: str
    supports_sso: bool
    is_open_source: bool
    is_in_yc: bool

data = app.scrape_url('https://docs.firecrawl.dev/', {
    'formats': ['extract'],
    'extract': {
        'schema': ExtractSchema.model_json_schema(),
    }
})
print(data["extract"])
输出:
JSON
{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
        "supports_sso": true,
        "is_open_source": false,
        "is_in_yc": true
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

无模式提取(新功能)

现在您可以通过仅向端点传递prompt来进行无模式提取。LLM会选择数据的结构。
curl -X POST https://api.firecrawl.dev/v1/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "formats": ["extract"],
      "extract": {
        "prompt": "从页面中提取公司使命。"
      }
    }'
输出:
JSON
{
    "success": true,
    "data": {
      "extract": {
        "company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
      },
      "metadata": {
        "title": "Mendable",
        "description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "robots": "follow, index",
        "ogTitle": "Mendable",
        "ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
        "ogUrl": "https://docs.firecrawl.dev/",
        "ogImage": "https://docs.firecrawl.dev/mendable_new_og1.png",
        "ogLocaleAlternate": [],
        "ogSiteName": "Mendable",
        "sourceURL": "https://docs.firecrawl.dev/"
      },
    }
}

Extract对象

extract对象接受以下参数:
  • schema:用于提取的模式。
  • systemPrompt:用于提取的系统提示。
  • prompt:用于无模式提取的提示。