构建AI驱动的网页爬虫：Chrome MCP Server的WebFetcherTool应用

2026-02-05 04:59:45作者：柏廷章Berta

Chrome MCP Server is a Chrome extension-based Model Context Protocol (MCP) server that exposes your Chrome browser functionality to AI assistants like Claude, enabling complex browser automation, content analysis, and semantic search.

项目地址：https://gitcode.com/gh_mirrors/mc/mcp-chrome

你是否还在为复杂的网页爬取任务烦恼？面对JavaScript渲染的动态内容、反爬机制和复杂的页面结构，传统爬虫往往束手无策。本文将介绍如何利用Chrome MCP Server的WebFetcherTool构建强大的AI驱动网页爬虫，无需复杂配置即可轻松应对现代网页挑战。读完本文，你将能够：

理解WebFetcherTool的工作原理
掌握使用AI辅助提取网页内容的方法
实现智能过滤和解析网页数据
处理动态加载和复杂交互页面

WebFetcherTool简介

WebFetcherTool是Chrome MCP Server（Model Context Protocol）的核心组件之一，它允许AI助手直接控制Chrome浏览器执行网页内容提取任务。与传统爬虫相比，WebFetcherTool具有以下优势：

真实浏览器环境：利用Chrome的渲染引擎处理JavaScript和动态内容
智能内容识别：内置基于Readability算法的内容提取器，自动识别主要内容区域
灵活参数控制：支持HTML/文本提取、CSS选择器定位和URL指定
AI协作能力：可直接与Claude等AI助手集成，实现复杂的内容分析和决策逻辑

WebFetcherTool的核心实现位于app/chrome-extension/entrypoints/background/tools/browser/web-fetcher.ts，它通过Chrome扩展的后台脚本与注入到页面的辅助脚本协同工作。

核心工作原理

WebFetcherTool的工作流程可以分为四个主要阶段：

标签页管理：根据提供的URL决定使用现有标签页或创建新标签页
内容注入：加载并注入辅助脚本到目标页面
内容提取：根据参数提取HTML或文本内容
结果处理：整理并返回提取结果

标签页管理机制

WebFetcherTool首先检查目标URL是否已在现有标签页中打开，这一逻辑在提高效率的同时避免了重复请求：

if (url) {
  // 检查URL是否已在现有标签页中打开
  console.log(`Checking if URL is already open: ${url}`);
  const allTabs = await chrome.tabs.query({});
  
  // 查找匹配的标签页
  const matchingTabs = allTabs.filter((t) => {
    // 规范化URL进行比较（移除尾部斜杠）
    const tabUrl = t.url?.endsWith('/') ? t.url.slice(0, -1) : t.url;
    const targetUrl = url.endsWith('/') ? url.slice(0, -1) : url;
    return tabUrl === targetUrl;
  });
  
  if (matchingTabs.length > 0) {
    // 使用现有标签页
    tab = matchingTabs[0];
    console.log(`Found existing tab with URL: ${url}, tab ID: ${tab.id}`);
  } else {
    // 创建新标签页
    console.log(`No existing tab found with URL: ${url}, creating new tab`);
    tab = await chrome.tabs.create({ url, active: true });
    
    // 等待页面加载
    console.log('Waiting for page to load...');
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
}

内容提取流程

WebFetcherTool支持两种主要的内容提取模式：HTML提取和文本提取。这两种模式通过不同的消息类型与注入脚本通信：

// 如果请求HTML内容
if (htmlContent) {
  const htmlResponse = await this.sendMessageToTab(tab.id, {
    action: TOOL_MESSAGE_TYPES.WEB_FETCHER_GET_HTML_CONTENT,
    selector: selector,
  });
  
  if (htmlResponse.success) {
    result.htmlContent = htmlResponse.htmlContent;
  } else {
    console.error('Failed to get HTML content:', htmlResponse.error);
    result.htmlContentError = htmlResponse.error;
  }
}

// 如果请求文本内容（且未请求HTML内容）
if (textContent) {
  const textResponse = await this.sendMessageToTab(tab.id, {
    action: TOOL_MESSAGE_TYPES.WEB_FETCHER_GET_TEXT_CONTENT,
    selector: selector,
  });
  
  if (textResponse.success) {
    result.textContent = textResponse.textContent;
    // 包含文章元数据（如果可用）
    if (textResponse.article) {
      result.article = {
        title: textResponse.article.title,
        byline: textResponse.article.byline,
        siteName: textResponse.article.siteName,
        excerpt: textResponse.article.excerpt,
        lang: textResponse.article.lang,
      };
    }
  }
}

内容处理核心：Readability算法

WebFetcherTool使用了基于Arc90 Readability算法的内容提取器，位于app/chrome-extension/inject-scripts/web-fetcher-helper.js。该算法通过以下步骤识别和提取网页主要内容：

预处理文档：移除样式、脚本和不必要的元素
识别候选内容块：分析页面结构，找出可能包含主要内容的元素
评分内容块：基于文本密度、链接比例等因素对候选块评分
提取最佳内容：选择得分最高的内容块作为主要内容
后处理：清理提取的内容，修复相对链接，格式化输出

Readability算法的核心在于其内容评分机制，它考虑了多种因素：

// 定义内容评分相关的正则表达式
REGEXPS: {
  unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,
  okMaybeItsACandidate: /and|article|body|column|content|main|shadow/i,
  positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
  negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|footer|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|widget/i,
  // 其他正则表达式...
}

该算法通过分析元素的标签、类名和内容特征来判断其是否包含主要内容，例如：

标题和段落标签（h1-h6, p）会获得较高分数
包含"article"、"content"等关键词的类名或ID会增加分数
包含"ad"、"sidebar"等关键词的元素会被降低分数或排除

使用示例：构建AI驱动的产品信息爬虫

下面我们通过一个实际示例展示如何使用WebFetcherTool构建一个智能产品信息爬虫。这个爬虫将能够：

访问电商网站产品页面
提取产品名称、价格和规格信息
分析用户评价
生成结构化的产品报告

基本使用代码

以下是使用WebFetcherTool提取产品信息的基本代码示例：

// 提取产品页面的基本信息
async function extractProductInfo(url) {
  // 调用WebFetcherTool提取产品页面内容
  const result = await chrome.runtime.sendMessage({
    action: "invoke-tool",
    tool: "browser.web_fetcher",
    params: {
      url: url,
      textContent: true,
      selector: "#product-details"
    }
  });
  
  if (result.success && result.textContent) {
    // 将提取的文本发送给AI进行结构化分析
    const structuredData = await chrome.runtime.sendMessage({
      action: "invoke-ai",
      prompt: `请分析以下产品信息并提取名称、价格、规格和评分：\n\n${result.textContent}`,
      format: "json"
    });
    
    return structuredData;
  } else {
    console.error("提取产品信息失败:", result.error);
    return null;
  }
}

// 使用示例
extractProductInfo("https://example.com/products/xyz")
  .then(product => console.log("产品信息:", product))
  .catch(error => console.error("错误:", error));

高级应用：动态内容处理

对于需要滚动加载或点击展开的内容，WebFetcherTool可以与其他工具（如交互工具）配合使用：

// 处理需要滚动加载的用户评价
async function extractProductReviews(url) {
  // 首先加载页面
  await chrome.runtime.sendMessage({
    action: "invoke-tool",
    tool: "browser.web_fetcher",
    params: {
      url: url,
      htmlContent: false,
      textContent: false
    }
  });
  
  // 滚动页面加载更多评价
  for (let i = 0; i < 3; i++) {
    await chrome.runtime.sendMessage({
      action: "invoke-tool",
      tool: "browser.interaction",
      params: {
        action: "scroll",
        direction: "down",
        distance: 1000
      }
    });
    
    // 等待加载
    await new Promise(resolve => setTimeout(resolve, 2000));
  }
  
  // 提取所有评价
  const reviews = await chrome.runtime.sendMessage({
    action: "invoke-tool",
    tool: "browser.web_fetcher",
    params: {
      textContent: true,
      selector: ".review-item"
    }
  });
  
  return reviews.textContent;
}