Python SDK | Firecrawl

安装

要安装Firecrawl Python SDK，您可以使用pip：

Python

# 安装firecrawl-py包
pip install firecrawl-py

使用方法

从firecrawl.dev获取API密钥
将API密钥设置为名为FIRECRAWL_API_KEY的环境变量，或者将其作为参数传递给FirecrawlApp类。

以下是如何使用SDK的示例：

Python

from firecrawl import FirecrawlApp

app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# 抓取一个网站:
scrape_status = app.scrape_url(
  'https://firecrawl.dev', 
  params={'formats': ['markdown', 'html']}
)
print(scrape_status)

# 爬取一个网站:
crawl_status = app.crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  }
)
print(crawl_status)

抓取单个URL

要抓取单个URL，请使用scrape_url方法。它接受URL作为参数，并以字典形式返回抓取的数据。

Python

# Scrape a website:
scrape_result = app.scrape_url('firecrawl.dev', params={'formats': ['markdown', 'html']})
print(scrape_result)

爬取网站

要爬取网站，请使用crawl_url方法。它接受起始URL和可选参数作为参数。params参数允许您为爬取任务指定其他选项，例如要爬取的最大页面数、允许的域名和输出格式。

Python

# 爬取一个网站
crawl_status = app.crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  }, 
  poll_interval=30
)
print(crawl_status)

异步爬取

要异步爬取网站，请使用crawl_url_async方法。它返回爬取ID，您可以使用该ID检查爬取任务的状态。它接受起始URL和可选参数作为参数。params参数允许您为爬取任务指定其他选项，例如要爬取的最大页面数、允许的域名和输出格式。

Python

# 异步爬取一个网站
crawl_status = app.async_crawl_url(
  'https://firecrawl.dev', 
  params={
    'limit': 100, 
    'scrapeOptions': {'formats': ['markdown', 'html']}
  }
)
print(crawl_status)

检查爬取状态

要检查爬取任务的状态，请使用check_crawl_status方法。它接受任务ID作为参数，并返回爬取任务的当前状态。

Python

# 检查爬取任务状态
crawl_status = app.check_crawl_status("<crawl_id>")
print(crawl_status)

取消爬取

要取消异步爬取任务，请使用cancel_crawl方法。它接受异步爬取的任务ID作为参数，并返回取消状态。

Python

# 取消爬取任务
cancel_crawl = app.cancel_crawl(id)
print(cancel_crawl)

映射网站

使用map_url生成网站的URL列表。params参数允许您自定义映射过程，包括排除子域名或使用网站地图的选项。

Python

# Map a website:
map_result = app.map_url('https://firecrawl.dev')
print(map_result)

使用WebSockets爬取网站

要使用WebSockets爬取网站，请使用crawl_url_and_watch方法。它接受起始URL和可选参数作为参数。params参数允许您为爬取任务指定其他选项，例如要爬取的最大页面数、允许的域名和输出格式。

Python

# 在异步函数内部...
nest_asyncio.apply()

# 定义事件处理器
def on_document(detail):
    print("DOC", detail)

def on_error(detail):
    print("ERR", detail['error'])

def on_done(detail):
    print("DONE", detail['status'])

# 启动爬取和监视过程的函数
async def start_crawl_and_watch():
    # 初始化爬取任务并获取监视器
    watcher = app.crawl_url_and_watch('firecrawl.dev', { 'excludePaths': ['blog/*'], 'limit': 5 })

    # 添加事件监听器
    watcher.add_event_listener("document", on_document)
    watcher.add_event_listener("error", on_error)
    watcher.add_event_listener("done", on_done)

    # 启动监视器
    await watcher.connect()

# 运行事件循环
await start_crawl_and_watch()

错误处理

SDK处理Firecrawl API返回的错误并引发适当的异常。如果在请求期间发生错误，将引发带有描述性错误消息的异常。

概览

开发工具包

Python

安装

使用方法

抓取单个URL

爬取网站

异步爬取

检查爬取状态

取消爬取

映射网站

使用WebSockets爬取网站

错误处理

概览

开发工具包

​安装

​使用方法

​抓取单个URL

​爬取网站

​异步爬取

​检查爬取状态

​取消爬取

​映射网站

​使用WebSockets爬取网站

​错误处理

安装

使用方法

抓取单个URL

爬取网站

异步爬取

检查爬取状态

取消爬取

映射网站

使用WebSockets爬取网站

错误处理