Pangolinfo

Posted on Apr 13

Async vs Sync Amazon Data Scraping API: A Complete Developer Guide

#api #python #ecommerce #scraper

TL;DR

Sync API: ~5s per call, blocks until result returns. Use for real-time, user-triggered queries.
Async API: submits task in <200ms, result delivered via callback. Use for batch/scheduled work.
Same credit cost both ways (1 credit/JSON call)
Pangolinfo Scrape API supports both modes under the same token

The Problem with Defaulting to Sync

Amazon Scrape API calls take around 5 seconds per response—realistic given what the server has to do (load Amazon pages, bypass anti-bot systems, parse DOM, return structured JSON). For a single query, that's totally fine.

For a monitoring system tracking 5,000 ASINs daily? That's 6.9 hours of serial waiting. Add network variance and you're looking at 8–10 hours. Most business use cases can't absorb that delay.

Async Amazon data scraping breaks the dependency between request throughput and client-side wait time. You submit tasks in bulk, the platform processes them concurrently server-side, and results arrive at your callback endpoint. Your client never blocks.

API Endpoint Comparison

	Sync	Async
Endpoint	`POST /api/v1/scrape`	`POST /api/v1/scrape/async`
Client wait	~5 seconds	<200ms (returns taskId)
Result delivery	Inline response body	POST to your `callbackUrl`
Infrastructure req	None	Public callback endpoint
Credits (JSON)	1/call	1/call

Sync Mode: Complete Example

import requests

def sync_scrape(token: str, asin: str) -> dict | None:
    """
    Sync Amazon product detail scrape.
    Blocks ~5s. Use for real-time queries.
    """
    resp = requests.post(
        "https://scrapeapi.pangolinfo.com/api/v1/scrape",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "url": f"https://www.amazon.com/dp/{asin}",
            "parserName": "amzProductDetail",
            "site": "",
            "content": "",
            "format": "json",
            "bizContext": {"zipcode": "10041"}
        },
        timeout=30  # Must exceed API processing time (~5s)
    )
    data = resp.json()
    if data.get("code") == 0:
        return data["data"]["json"][0]["data"]["results"][0]
    raise RuntimeError(f"API error: {data.get('message')}")


# Usage
product = sync_scrape("your_token", "B0DYTF8L2W")
print(product["title"])
print(product["star"])        # e.g. "4.5"
print(product["bestSellersRank"])

Available Parser Types

PARSER_NAMES = {
    "amzProductDetail":    "Product detail by ASIN or URL",
    "amzKeyword":          "Keyword search results",
    "amzProductOfCategory": "Category product list (by Node ID)",
    "amzProductOfSeller":  "Seller product list",
    "amzBestSellers":      "Best sellers ranking",
    "amzNewReleases":      "New releases ranking"
}

Async Mode: Complete Implementation

1. Submit Tasks

import requests
import time
from typing import Optional

ASYNC_URL = "https://scrapeapi.pangolinfo.com/api/v1/scrape/async"

def async_submit(
    token: str,
    asin: str,
    callback_url: str,
    parser_name: str = "amzProductDetail",
    zipcode: str = "10041"
) -> Optional[str]:
    """Submit async scraping task. Returns task_id immediately."""
    resp = requests.post(
        ASYNC_URL,
        headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
        json={
            "url": f"https://www.amazon.com/dp/{asin}",
            "callbackUrl": callback_url,
            "zipcode": zipcode,
            "format": "json",
            "parserName": parser_name
        },
        timeout=10  # Submission itself is fast
    )
    data = resp.json()
    if data.get("code") == 0:
        return data["data"]["data"]  # task_id
    print(f"Submit failed for {asin}: {data.get('message')}")
    return None


def bulk_submit(
    token: str,
    asins: list[str],
    callback_url: str,
    rate_delay: float = 0.1  # seconds between submissions
) -> dict[str, str]:
    """Bulk submit. Returns {asin: task_id}."""
    task_map = {}
    for i, asin in enumerate(asins):
        task_id = async_submit(token, asin, callback_url)
        if task_id:
            task_map[asin] = task_id
        if (i + 1) % 500 == 0:
            print(f"Submitted {i+1}/{len(asins)}")
        time.sleep(rate_delay)
    return task_map

2. Callback Receiver (FastAPI)

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = FastAPI()
processed_ids: set[str] = set()  # Use Redis in production

@app.post("/api/callback")
async def receive_amazon_callback(request: Request):
    payload = await request.json()
    task_id = payload.get("taskId")

    # Idempotency guard - critical for production
    if task_id in processed_ids:
        return JSONResponse({"code": 0, "message": "duplicate"})
    processed_ids.add(task_id)

    # Extract results
    results = (
        payload
        .get("data", {})
        .get("json", [{}])[0]
        .get("data", {})
        .get("results", [])
    )
    if results:
        product = results[0]
        print(f"Received: {product.get('title', 'N/A')[:60]}")
        # persist_to_db(task_id, product)

    return JSONResponse({"code": 0, "message": "ok"})

3. Timeout Compensation

def compensate_missing_callbacks(
    token: str,
    task_map: dict[str, str],
    received_ids: set[str]
) -> None:
    """
    For tasks that didn't callback within expected window,
    fetch results via sync API as fallback.
    """
    missing = {asin: tid for asin, tid in task_map.items()
               if tid not in received_ids}

    for asin, task_id in missing.items():
        print(f"Compensating {asin} (task {task_id})")
        result = sync_scrape(token, asin)  # fallback to sync
        if result:
            print(f"Compensated: {result.get('title', 'N/A')[:40]}")

Decision Framework

Daily task volume?
│
├── < 100 tasks/day
│   └── Need instant results?
│       ├── Yes → Sync API ✓
│       └── No  → Sync API ✓ (async overhead not worth it)
│
└── ≥ 100 tasks/day OR scheduled batches
    └── Can deploy public callback server?
        ├── Yes → Async API ✓ (10x+ throughput)
        └── No  → Sync + threading, or AMZ Data Tracker (no-code)

Local Development Setup

Testing async locally without deploying to a server:

# Terminal 1: Run your callback receiver
uvicorn main:app --port 5000

# Terminal 2: Expose it publicly via ngrok
ngrok http 5000
# → Forwarding: https://abc123.ngrok.io → http://localhost:5000

# Use the ngrok URL as your callbackUrl for testing
CALLBACK_URL = "https://abc123.ngrok.io/api/callback"

Performance Benchmarks

Task Count	Sync (single-thread)	Async
100	~8 min	~2 min
1,000	~83 min	~5-10 min
5,000	~7 hrs	~20-40 min
10,000	~14 hrs	~45-90 min

Credit cost is identical in both modes. Performance gap grows linearly with volume.

Resources

Pangolinfo Scrape API — supports both sync and async
API Documentation — complete parameter reference
AMZ Data Tracker — no-code scheduled data collection

Questions? Drop them in the comments.

DEV Community