Crawlee-Python 项目：在 Web 服务器环境中运行爬虫的技术实践

2025-06-06 23:28:02作者：彭桢灵Jeremy

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

项目地址：https://gitcode.com/GitHub_Trending/cr/crawlee-python

Crawlee 是一个强大的开源爬虫框架，其 Python 版本（crawlee-python）为开发者提供了高效的数据抓取能力。本文将重点探讨如何在 Web 服务器环境中部署和运行 Crawlee 爬虫，实现 API 化的爬虫服务。

核心场景与需求

在实际开发中，我们经常需要将爬虫能力封装为 Web 服务，通过 API 接口对外提供数据抓取功能。典型的应用场景包括：

接收用户提交的 URL，返回页面结构化数据
提供动态启停爬虫任务的能力
实现爬虫任务的监控和管理

关键技术实现

1. 禁用本地存储

在 Web 服务器环境中，我们通常不需要将爬取结果持久化到本地文件系统。Crawlee 提供了灵活的配置选项：

from crawlee import service_locator

# 获取配置对象
configuration = service_locator.get_configuration()

# 禁用存储持久化
configuration.persist_storage = False

# 禁用元数据写入
configuration.write_metadata = False

这些配置项在使用 MemoryStorageClient 时特别有用，可以避免不必要的磁盘 I/O 操作，提升服务响应速度。

2. 与 Web 框架集成

以 FastAPI 为例，我们可以轻松地将 Crawlee 爬虫封装为 API 端点：

from fastapi import FastAPI
from crawlee import PlaywrightCrawler

app = FastAPI()

@app.post("/crawl")
async def crawl_url(url: str):
    results = []
    
    async def handle_page(page):
        title = await page.title()
        results.append({"url": page.url, "title": title})
    
    crawler = PlaywrightCrawler(
        request_handler=handle_page,
        headless=True
    )
    
    await crawler.run([url])
    return {"results": results}