Crawlee-python 项目使用教程

2026-01-30 05:25:26作者：裴锟轩Denise

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Parsel, BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

项目地址：https://gitcode.com/GitHub_Trending/cr/crawlee-python

1. 项目的目录结构及介绍

Crawlee-python 是一个用于网络爬取和浏览器自动化的 Python 库，其目录结构如下：

src/crawlee: 包含 Crawlee 库的核心代码。
tests: 包含对 Crawlee 功能的单元测试。
website: 存放项目网站相关的文件。
.github: 存放 GitHub Actions 工作流文件。
.gitignore: 指定 Git 忽略的文件。
CHANGELOG.md: 记录项目的更新和修改历史。
CONTRIBUTING.md: 指导贡献者如何向项目贡献代码。
LICENSE: 项目的 Apache-2.0 许可文件。
Makefile: 用于构建和测试项目的 Makefile 文件。
README.md: 项目说明文件。
pyproject.toml: 包含项目元数据和依赖关系的配置文件。
renovate.json: 配置自动更新依赖的工具。
uv.lock: 用于锁定项目依赖的文件。

2. 项目的启动文件介绍

Crawlee-python 项目的启动通常是通过编写 Python 脚本来实现。以下是一个简单的启动文件示例，它创建了一个 BeautifulSoupCrawler 实例并开始爬取。

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler

async def main():
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=10,
    )

    @crawler.router.default_handler
    async def request_handler(context):
        context.log.info(f'Processing {context.request.url}...')
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
        }
        await context.push_data(data)
        await context.enqueue_links()

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

在这个启动文件中，main() 函数是异步的，它创建了一个爬虫实例，定义了一个默认的处理函数 request_handler，然后启动爬虫。

3. 项目的配置文件介绍

Crawlee-python 使用 pyproject.toml 文件来管理项目的配置，包括依赖项。以下是一个配置文件的示例：

[build-system]
requires = ["setuptools", "wheel"]

[tool.setuptools]
packages = ["crawlee"]

[options]
packages = find:

[options.entry_points]
console_scripts =
    crawlee = crawlee.cli:main