Tags: #api #python #ecommerce #automation #mcp #webscaping
TL;DR: Amazon product images live in JavaScript-rendered DOM nodes — static requests return nothing. Bot detection kills DIY scrapers within 48 hours. The reliable path: call Pangolinfo Scrape API for structured JSON with all image URLs, or connect via MCP to drive collection from Claude/Open Claw with zero code.
The Problem Space
Amazon product image bulk download sounds trivial. In practice it's a multi-layer engineering problem:
Layer 1 — Dynamic Rendering: Amazon's product pages use React. Image data is not in the initial HTML response. A requests.get() call returns a skeleton with no images.
Layer 2 — Bot Detection: TLS fingerprint (JA3/JA4), HTTP/2 frame parameters, behavioral entropy (mouse patterns, scroll timing), IP reputation. Median stable window for unmasked automation: under 48 hours.
Layer 3 — URL Instability: CDN links on images-amazon.com may carry expiring signature parameters. Multiple resolution variants exist per image (._SL1500_, ._AC_SL1000_, ._SX679_). Normalization and prompt download are required.
Solution A: Pangolinfo Scrape API (Direct Call)
Pangolinfo Scrape API handles layers 1–3 internally. You send an ASIN, get back structured JSON.
Image response schema
{
"asin": "B09XYZ1234",
"main_image": "https://images-amazon.com/images/I/71abc._SL2000_.jpg",
"images": ["url1", "url2", "..."],
"aplus_images": ["https://images-amazon.com/images/S/aplus-media/..."],
"video_thumbnail": "https://..."
}
Async bulk collection (Python)
import asyncio
import aiohttp
import json
API_KEY = "your_pangolinfo_api_key"
BASE_URL = "https://api.pangolinfo.com/v2/amazon/product"
async def fetch_images(session: aiohttp.ClientSession, asin: str) -> dict:
params = {"asin": asin, "marketplace": "US"}
headers = {"Authorization": f"Bearer {API_KEY}"}
try:
async with session.get(BASE_URL, params=params, headers=headers,
timeout=aiohttp.ClientTimeout(total=30)) as resp:
if resp.status == 200:
data = await resp.json()
return {
"asin": asin,
"ok": True,
"main": data.get("main_image", ""),
"gallery": data.get("images", []), # up to 9
"aplus": data.get("aplus_images", []),
"thumb": data.get("video_thumbnail", "")
}
return {"asin": asin, "ok": False, "status": resp.status}
except asyncio.TimeoutError:
return {"asin": asin, "ok": False, "status": "timeout"}
async def bulk_fetch(asins: list[str], concurrency: int = 20) -> list[dict]:
sem = asyncio.Semaphore(concurrency)
async def bounded(s, asin):
async with sem:
return await fetch_images(s, asin)
connector = aiohttp.TCPConnector(limit=concurrency)
async with aiohttp.ClientSession(connector=connector) as s:
tasks = [bounded(s, asin) for asin in asins]
return await asyncio.gather(*tasks)
if __name__ == "__main__":
asins = ["B09XYZ1234", "B08ABC5678", "B07DEF9012"]
results = asyncio.run(bulk_fetch(asins, concurrency=20))
ok = [r for r in results if r["ok"]]
print(f"Success: {len(ok)}/{len(asins)}")
print(json.dumps(results[:2], indent=2))
Performance: ~3–8 min for 1,000 ASINs at concurrency=20.
Normalize image URLs to highest resolution
import re
def to_hires(url: str, size: int = 2000) -> str:
"""Force Amazon image URL to specified resolution."""
url = re.sub(r'\._[A-Z]{2}[^.]*_\.', '.', url) # strip existing modifier
return url.replace('.jpg', f'._SL{size}_.jpg')
# Usage
raw = "https://images-amazon.com/images/I/71abc._AC_SL1000_.jpg"
print(to_hires(raw)) # → ...._SL2000_.jpg
Error handling pattern
async def fetch_with_retry(session, asin, max_retries=3):
for attempt in range(max_retries):
result = await fetch_images(session, asin)
if result["ok"]:
return result
if result.get("status") == 429:
# Rate limited — exponential backoff
await asyncio.sleep(2 ** attempt)
elif result.get("status") == 404:
# ASIN delisted — skip
return result
# Other errors: retry
return result
Solution B: MCP Protocol — No API Code Required
Pangolinfo Amazon Scraper Skill implements the MCP (Model Context Protocol) standard. Once registered as an MCP tool server, it's callable by any MCP-compatible agent.
MCP tool definition (excerpt)
{
"name": "get_product_images",
"description": "Fetch image URLs for one or more Amazon ASINs",
"inputSchema": {
"type": "object",
"properties": {
"asins": { "type": "array", "items": { "type": "string" } },
"marketplace": { "type": "string", "enum": ["US","UK","DE","JP","CA"], "default": "US" }
},
"required": ["asins"]
}
}
Claude Desktop config
// ~/.config/claude/claude_desktop_config.json
{
"mcpServers": {
"pangolinfo-amazon": {
"command": "npx",
"args": ["@pangolinfo/amazon-scraper-mcp", "--log-level", "warn"],
"env": {
"PANGOLINFO_API_KEY": "your_key_here"
}
}
}
}
After this, Claude can call get_product_images in any conversation. Example prompt:
Fetch main image URLs for the top 50 ASINs in this list: [paste list]
Then group them by image type (lifestyle vs. white background vs. infographic)
and tell me which style dominates the category.
Claude calls the MCP tool → gets image data → runs analysis → returns a structured breakdown. No API client code written by you.
Open Claw integration
Install from Tool Marketplace → configure API key → drag the "Get Product Images" node into any workflow → connect to spreadsheet output. Natural language trigger in chat mode also works.
Production Checklist
- [ ] Async concurrent requests with semaphore (concurrency 10–20 for standard accounts)
- [ ] Exponential backoff on 429 responses
- [ ] Skip 404 ASINs (delisted), log for review
- [ ] Normalize all image URLs to
._SL2000_.jpgimmediately after collection - [ ] Download images to object storage (S3/GCS) promptly — don't rely on CDN URL longevity
- [ ] Store metadata (ASIN, image path, collection timestamp) in a queryable DB
- [ ] Set up alerts for API success rate < 95% or P99 latency > 5s
Top comments (0)