Pangolinfo

Posted on May 18

Bulk Downloading Amazon Product Images via API and MCP: A Complete Developer Guide

#python #webscraping #automation #api

Tags: #api #python #ecommerce #automation #mcp #webscaping

TL;DR: Amazon product images live in JavaScript-rendered DOM nodes — static requests return nothing. Bot detection kills DIY scrapers within 48 hours. The reliable path: call Pangolinfo Scrape API for structured JSON with all image URLs, or connect via MCP to drive collection from Claude/Open Claw with zero code.

The Problem Space

Amazon product image bulk download sounds trivial. In practice it's a multi-layer engineering problem:

Layer 1 — Dynamic Rendering: Amazon's product pages use React. Image data is not in the initial HTML response. A requests.get() call returns a skeleton with no images.

Layer 2 — Bot Detection: TLS fingerprint (JA3/JA4), HTTP/2 frame parameters, behavioral entropy (mouse patterns, scroll timing), IP reputation. Median stable window for unmasked automation: under 48 hours.

Layer 3 — URL Instability: CDN links on images-amazon.com may carry expiring signature parameters. Multiple resolution variants exist per image (._SL1500_, ._AC_SL1000_, ._SX679_). Normalization and prompt download are required.

Solution A: Pangolinfo Scrape API (Direct Call)

Pangolinfo Scrape API handles layers 1–3 internally. You send an ASIN, get back structured JSON.

Image response schema

{
  "asin": "B09XYZ1234",
  "main_image": "https://images-amazon.com/images/I/71abc._SL2000_.jpg",
  "images": ["url1", "url2", "..."],
  "aplus_images": ["https://images-amazon.com/images/S/aplus-media/..."],
  "video_thumbnail": "https://..."
}

Async bulk collection (Python)

import asyncio
import aiohttp
import json

API_KEY = "your_pangolinfo_api_key"
BASE_URL = "https://api.pangolinfo.com/v2/amazon/product"

async def fetch_images(session: aiohttp.ClientSession, asin: str) -> dict:
    params = {"asin": asin, "marketplace": "US"}
    headers = {"Authorization": f"Bearer {API_KEY}"}
    try:
        async with session.get(BASE_URL, params=params, headers=headers,
                               timeout=aiohttp.ClientTimeout(total=30)) as resp:
            if resp.status == 200:
                data = await resp.json()
                return {
                    "asin": asin,
                    "ok": True,
                    "main": data.get("main_image", ""),
                    "gallery": data.get("images", []),       # up to 9
                    "aplus": data.get("aplus_images", []),
                    "thumb": data.get("video_thumbnail", "")
                }
            return {"asin": asin, "ok": False, "status": resp.status}
    except asyncio.TimeoutError:
        return {"asin": asin, "ok": False, "status": "timeout"}

async def bulk_fetch(asins: list[str], concurrency: int = 20) -> list[dict]:
    sem = asyncio.Semaphore(concurrency)
    async def bounded(s, asin):
        async with sem:
            return await fetch_images(s, asin)
    connector = aiohttp.TCPConnector(limit=concurrency)
    async with aiohttp.ClientSession(connector=connector) as s:
        tasks = [bounded(s, asin) for asin in asins]
        return await asyncio.gather(*tasks)

if __name__ == "__main__":
    asins = ["B09XYZ1234", "B08ABC5678", "B07DEF9012"]
    results = asyncio.run(bulk_fetch(asins, concurrency=20))

    ok = [r for r in results if r["ok"]]
    print(f"Success: {len(ok)}/{len(asins)}")
    print(json.dumps(results[:2], indent=2))

Performance: ~3–8 min for 1,000 ASINs at concurrency=20.

Normalize image URLs to highest resolution

import re

def to_hires(url: str, size: int = 2000) -> str:
    """Force Amazon image URL to specified resolution."""
    url = re.sub(r'\._[A-Z]{2}[^.]*_\.', '.', url)  # strip existing modifier
    return url.replace('.jpg', f'._SL{size}_.jpg')

# Usage
raw = "https://images-amazon.com/images/I/71abc._AC_SL1000_.jpg"
print(to_hires(raw))  # → ...._SL2000_.jpg

Error handling pattern

async def fetch_with_retry(session, asin, max_retries=3):
    for attempt in range(max_retries):
        result = await fetch_images(session, asin)
        if result["ok"]:
            return result
        if result.get("status") == 429:
            # Rate limited — exponential backoff
            await asyncio.sleep(2 ** attempt)
        elif result.get("status") == 404:
            # ASIN delisted — skip
            return result
        # Other errors: retry
    return result

Solution B: MCP Protocol — No API Code Required

Pangolinfo Amazon Scraper Skill implements the MCP (Model Context Protocol) standard. Once registered as an MCP tool server, it's callable by any MCP-compatible agent.

MCP tool definition (excerpt)

{
  "name": "get_product_images",
  "description": "Fetch image URLs for one or more Amazon ASINs",
  "inputSchema": {
    "type": "object",
    "properties": {
      "asins": { "type": "array", "items": { "type": "string" } },
      "marketplace": { "type": "string", "enum": ["US","UK","DE","JP","CA"], "default": "US" }
    },
    "required": ["asins"]
  }
}

Claude Desktop config

// ~/.config/claude/claude_desktop_config.json
{
  "mcpServers": {
    "pangolinfo-amazon": {
      "command": "npx",
      "args": ["@pangolinfo/amazon-scraper-mcp", "--log-level", "warn"],
      "env": {
        "PANGOLINFO_API_KEY": "your_key_here"
      }
    }
  }
}

After this, Claude can call get_product_images in any conversation. Example prompt:

Fetch main image URLs for the top 50 ASINs in this list: [paste list]
Then group them by image type (lifestyle vs. white background vs. infographic)
and tell me which style dominates the category.

Claude calls the MCP tool → gets image data → runs analysis → returns a structured breakdown. No API client code written by you.

Open Claw integration

Install from Tool Marketplace → configure API key → drag the "Get Product Images" node into any workflow → connect to spreadsheet output. Natural language trigger in chat mode also works.

Production Checklist

[ ] Async concurrent requests with semaphore (concurrency 10–20 for standard accounts)
[ ] Exponential backoff on 429 responses
[ ] Skip 404 ASINs (delisted), log for review
[ ] Normalize all image URLs to ._SL2000_.jpg immediately after collection
[ ] Download images to object storage (S3/GCS) promptly — don't rely on CDN URL longevity
[ ] Store metadata (ASIN, image path, collection timestamp) in a queryable DB
[ ] Set up alerts for API success rate < 95% or P99 latency > 5s

DEV Community