Crawlee-Python项目：实现持续运行的网络爬虫循环

2025-06-07 05:18:57作者：农烁颖Land

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

项目地址：https://gitcode.com/GitHub_Trending/cr/crawlee-python

在Crawlee-Python项目中实现一个持续运行的网络爬虫是许多开发者需要的功能。这种爬虫能够周期性地检查新请求并执行抓取任务，非常适合监控网站内容变化或处理动态生成的URL列表。

核心实现原理

通过Python的异步编程框架asyncio，我们可以构建一个永不退出的爬虫循环。这个循环会定期执行以下操作：

获取待抓取的URL列表
使用BeautifulSoupCrawler处理这些URL
休眠指定时间后重复执行

关键技术实现

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler()
    
    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        # 提取页面数据
        data = {
            'url': context.request.url,
            'title': context.soup.title.string if context.soup.title else None,
            'headers': {
                'h1': [h1.text for h1 in context.soup.find_all('h1')],
                'h2': [h2.text for h2 in context.soup.find_all('h2')],
                'h3': [h3.text for h3 in context.soup.find_all('h3')],
            }
        }
        await context.push_data(data)

    while True:
        urls = get_urls_from_source()  # 自定义URL获取逻辑
        await crawler.run(urls)
        await asyncio.sleep(60)  # 每分钟检查一次

if __name__ == '__main__':
    asyncio.run(main())