DEV Community

Cover image for Bulk Downloading Amazon Product Images via API and MCP: A Complete Developer Guide
Mox Loop
Mox Loop

Posted on

Bulk Downloading Amazon Product Images via API and MCP: A Complete Developer Guide

Tags: #api #python #ecommerce #automation #mcp #webscaping

TL;DR: Amazon product images live in JavaScript-rendered DOM nodes — static requests return nothing. Bot detection kills DIY scrapers within 48 hours. The reliable path: call Pangolinfo Scrape API for structured JSON with all image URLs, or connect via MCP to drive collection from Claude/Open Claw with zero code.


The Problem Space

Amazon product image bulk download sounds trivial. In practice it's a multi-layer engineering problem:

Layer 1 — Dynamic Rendering: Amazon's product pages use React. Image data is not in the initial HTML response. A requests.get() call returns a skeleton with no images.

Layer 2 — Bot Detection: TLS fingerprint (JA3/JA4), HTTP/2 frame parameters, behavioral entropy (mouse patterns, scroll timing), IP reputation. Median stable window for unmasked automation: under 48 hours.

Layer 3 — URL Instability: CDN links on images-amazon.com may carry expiring signature parameters. Multiple resolution variants exist per image (._SL1500_, ._AC_SL1000_, ._SX679_). Normalization and prompt download are required.


Solution A: Pangolinfo Scrape API (Direct Call)

Pangolinfo Scrape API handles layers 1–3 internally. You send an ASIN, get back structured JSON.

Image response schema

{
  "asin": "B09XYZ1234",
  "main_image": "https://images-amazon.com/images/I/71abc._SL2000_.jpg",
  "images": ["url1", "url2", "..."],
  "aplus_images": ["https://images-amazon.com/images/S/aplus-media/..."],
  "video_thumbnail": "https://..."
}
Enter fullscreen mode Exit fullscreen mode

Async bulk collection (Python)

import asyncio
import aiohttp
import json

API_KEY = "your_pangolinfo_api_key"
BASE_URL = "https://api.pangolinfo.com/v2/amazon/product"

async def fetch_images(session: aiohttp.ClientSession, asin: str) -> dict:
    params = {"asin": asin, "marketplace": "US"}
    headers = {"Authorization": f"Bearer {API_KEY}"}
    try:
        async with session.get(BASE_URL, params=params, headers=headers,
                               timeout=aiohttp.ClientTimeout(total=30)) as resp:
            if resp.status == 200:
                data = await resp.json()
                return {
                    "asin": asin,
                    "ok": True,
                    "main": data.get("main_image", ""),
                    "gallery": data.get("images", []),       # up to 9
                    "aplus": data.get("aplus_images", []),
                    "thumb": data.get("video_thumbnail", "")
                }
            return {"asin": asin, "ok": False, "status": resp.status}
    except asyncio.TimeoutError:
        return {"asin": asin, "ok": False, "status": "timeout"}

async def bulk_fetch(asins: list[str], concurrency: int = 20) -> list[dict]:
    sem = asyncio.Semaphore(concurrency)
    async def bounded(s, asin):
        async with sem:
            return await fetch_images(s, asin)
    connector = aiohttp.TCPConnector(limit=concurrency)
    async with aiohttp.ClientSession(connector=connector) as s:
        tasks = [bounded(s, asin) for asin in asins]
        return await asyncio.gather(*tasks)

if __name__ == "__main__":
    asins = ["B09XYZ1234", "B08ABC5678", "B07DEF9012"]
    results = asyncio.run(bulk_fetch(asins, concurrency=20))

    ok = [r for r in results if r["ok"]]
    print(f"Success: {len(ok)}/{len(asins)}")
    print(json.dumps(results[:2], indent=2))
Enter fullscreen mode Exit fullscreen mode

Performance: ~3–8 min for 1,000 ASINs at concurrency=20.

Normalize image URLs to highest resolution

import re

def to_hires(url: str, size: int = 2000) -> str:
    """Force Amazon image URL to specified resolution."""
    url = re.sub(r'\._[A-Z]{2}[^.]*_\.', '.', url)  # strip existing modifier
    return url.replace('.jpg', f'._SL{size}_.jpg')

# Usage
raw = "https://images-amazon.com/images/I/71abc._AC_SL1000_.jpg"
print(to_hires(raw))  # → ...._SL2000_.jpg
Enter fullscreen mode Exit fullscreen mode

Error handling pattern

async def fetch_with_retry(session, asin, max_retries=3):
    for attempt in range(max_retries):
        result = await fetch_images(session, asin)
        if result["ok"]:
            return result
        if result.get("status") == 429:
            # Rate limited — exponential backoff
            await asyncio.sleep(2 ** attempt)
        elif result.get("status") == 404:
            # ASIN delisted — skip
            return result
        # Other errors: retry
    return result
Enter fullscreen mode Exit fullscreen mode

Solution B: MCP Protocol — No API Code Required

Pangolinfo Amazon Scraper Skill implements the MCP (Model Context Protocol) standard. Once registered as an MCP tool server, it's callable by any MCP-compatible agent.

MCP tool definition (excerpt)

{
  "name": "get_product_images",
  "description": "Fetch image URLs for one or more Amazon ASINs",
  "inputSchema": {
    "type": "object",
    "properties": {
      "asins": { "type": "array", "items": { "type": "string" } },
      "marketplace": { "type": "string", "enum": ["US","UK","DE","JP","CA"], "default": "US" }
    },
    "required": ["asins"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Claude Desktop config

// ~/.config/claude/claude_desktop_config.json
{
  "mcpServers": {
    "pangolinfo-amazon": {
      "command": "npx",
      "args": ["@pangolinfo/amazon-scraper-mcp", "--log-level", "warn"],
      "env": {
        "PANGOLINFO_API_KEY": "your_key_here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

After this, Claude can call get_product_images in any conversation. Example prompt:

Fetch main image URLs for the top 50 ASINs in this list: [paste list]
Then group them by image type (lifestyle vs. white background vs. infographic)
and tell me which style dominates the category.
Enter fullscreen mode Exit fullscreen mode

Claude calls the MCP tool → gets image data → runs analysis → returns a structured breakdown. No API client code written by you.

Open Claw integration

Install from Tool Marketplace → configure API key → drag the "Get Product Images" node into any workflow → connect to spreadsheet output. Natural language trigger in chat mode also works.


Production Checklist

  • [ ] Async concurrent requests with semaphore (concurrency 10–20 for standard accounts)
  • [ ] Exponential backoff on 429 responses
  • [ ] Skip 404 ASINs (delisted), log for review
  • [ ] Normalize all image URLs to ._SL2000_.jpg immediately after collection
  • [ ] Download images to object storage (S3/GCS) promptly — don't rely on CDN URL longevity
  • [ ] Store metadata (ASIN, image path, collection timestamp) in a queryable DB
  • [ ] Set up alerts for API success rate < 95% or P99 latency > 5s

Links

Top comments (0)