使用Crawl4AI高效抓取亚马逊商品大图的技巧

2025-05-02 22:55:11作者：余洋婵Anita

在电商数据采集过程中，获取商品的高清大图是一个常见需求。本文将以Crawl4AI项目为例，详细介绍如何解决动态加载图片的抓取难题。

问题背景

在抓取亚马逊商品页面时，开发者经常遇到只能获取缩略图而无法直接获取大图的问题。这是因为亚马逊采用了动态加载技术，只有当用户交互（如鼠标悬停或点击）后，大图才会被加载到页面中。

技术原理分析

亚马逊商品页面的图片展示采用了典型的"懒加载"技术：

页面初始加载时只显示缩略图
当用户与缩略图交互时，通过JavaScript动态加载对应的大图
大图URL通常包含特定标识如"SX679_"等特征

解决方案实现

使用Crawl4AI时，我们可以通过注入自定义JavaScript代码来模拟用户交互行为：

async def main():
    async with AsyncWebCrawler(
            headless=False,
            verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="亚马逊商品URL",
            cache_mode=CacheMode.BYPASS,
            js_code = """
                const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
                
                window.scrollTo(0, 0);

                async function clickWithDelay() {
                    const items = document.querySelectorAll('#altImages .a-button.a-button-thumbnail');
                    
                    for (let item of items) {
                        item.click();
                        await delay(1000); 
                    }
                }

                clickWithDelay();            
            """,
        )
        
        for img in result.media['images']:
            if img['src'].startswith('https://m.media-amazon.com/images/I/') and img['src'].endswith('.jpg') and 'SX679_' in img['src']:
                print(img['src'])   

if __name__ == "__main__":
    asyncio.run(main())