Crawlee Python 爬虫框架处理请求失败问题解析

2025-06-07 13:08:54作者：袁立春Spencer

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

项目地址：https://gitcode.com/GitHub_Trending/cr/crawlee-python

问题现象

在使用 Crawlee Python 框架爬取特定网站时，开发者遇到了部分 URL 请求返回空响应的问题。具体表现为某些页面（如特定格式的 deck 页面）无法获取到有效内容，而其他类型的页面却能正常爬取。

问题根源分析

经过深入排查，发现问题的根本原因是目标网站对某些特定路径的请求返回了 406 HTTP 状态码（Not Acceptable）。这种响应通常表示服务器无法根据客户端请求的内容特性完成请求。

通过 curl 命令验证可以确认这一现象：

curl -vvv https://www.mtggoldfish.com/deck/6610848

解决方案

1. 使用 CurlImpersonateHttpClient

Crawlee 提供了专门的 HTTP 客户端实现来应对这类反爬机制：

from crawlee.http_clients.curl_impersonate import CurlImpersonateHttpClient

crawler = BeautifulSoupCrawler(
    http_client=CurlImpersonateHttpClient()
    # 其他配置...
)

这种方法通过模拟真实浏览器的请求特征，可以有效绕过部分网站的反爬检测。

2. 处理 Windows 平台警告

在使用 CurlImpersonateHttpClient 时，Windows 平台可能会出现事件循环警告。可以通过以下方式解决：

import asyncio
from asyncio import WindowsSelectorEventLoopPolicy

asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())

3. 请求速率控制

当爬取大量数据时，网站可能会实施请求限流。可以通过两种方式优化：

方法一：配置并发设置

from crawlee import ConcurrencySettings

crawler = BeautifulSoupCrawler(
    concurrency_settings=ConcurrencySettings(max_tasks_per_minute=60)
    # 其他配置...
)

方法二：自定义重试逻辑

class ThrottlingHandler:
    def __init__(self):
        self.wait_time = 5
        self.throttle_count = 0

    def handle(self, context):
        if context.http_response.status_code >= 400:
            context.log.warning(f"请求被限制，等待 {self.wait_time} 秒")
            time.sleep(self.wait_time)
            self.wait_time += random.randint(5, 10)
            raise RuntimeError("请求被限制")