DEV Community

Cover image for Async vs Sync Amazon Data Scraping API: A Complete Developer Guide
Mox Loop
Mox Loop

Posted on

Async vs Sync Amazon Data Scraping API: A Complete Developer Guide

TL;DR

  • Sync API: ~5s per call, blocks until result returns. Use for real-time, user-triggered queries.
  • Async API: submits task in <200ms, result delivered via callback. Use for batch/scheduled work.
  • Same credit cost both ways (1 credit/JSON call)
  • Pangolinfo Scrape API supports both modes under the same token

The Problem with Defaulting to Sync

Amazon Scrape API calls take around 5 seconds per response—realistic given what the server has to do (load Amazon pages, bypass anti-bot systems, parse DOM, return structured JSON). For a single query, that's totally fine.

For a monitoring system tracking 5,000 ASINs daily? That's 6.9 hours of serial waiting. Add network variance and you're looking at 8–10 hours. Most business use cases can't absorb that delay.

Async Amazon data scraping breaks the dependency between request throughput and client-side wait time. You submit tasks in bulk, the platform processes them concurrently server-side, and results arrive at your callback endpoint. Your client never blocks.


API Endpoint Comparison

Sync Async
Endpoint POST /api/v1/scrape POST /api/v1/scrape/async
Client wait ~5 seconds <200ms (returns taskId)
Result delivery Inline response body POST to your callbackUrl
Infrastructure req None Public callback endpoint
Credits (JSON) 1/call 1/call

Sync Mode: Complete Example

import requests

def sync_scrape(token: str, asin: str) -> dict | None:
    """
    Sync Amazon product detail scrape.
    Blocks ~5s. Use for real-time queries.
    """
    resp = requests.post(
        "https://scrapeapi.pangolinfo.com/api/v1/scrape",
        headers={
            "Authorization": f"Bearer {token}",
            "Content-Type": "application/json"
        },
        json={
            "url": f"https://www.amazon.com/dp/{asin}",
            "parserName": "amzProductDetail",
            "site": "",
            "content": "",
            "format": "json",
            "bizContext": {"zipcode": "10041"}
        },
        timeout=30  # Must exceed API processing time (~5s)
    )
    data = resp.json()
    if data.get("code") == 0:
        return data["data"]["json"][0]["data"]["results"][0]
    raise RuntimeError(f"API error: {data.get('message')}")


# Usage
product = sync_scrape("your_token", "B0DYTF8L2W")
print(product["title"])
print(product["star"])        # e.g. "4.5"
print(product["bestSellersRank"])
Enter fullscreen mode Exit fullscreen mode

Available Parser Types

PARSER_NAMES = {
    "amzProductDetail":    "Product detail by ASIN or URL",
    "amzKeyword":          "Keyword search results",
    "amzProductOfCategory": "Category product list (by Node ID)",
    "amzProductOfSeller":  "Seller product list",
    "amzBestSellers":      "Best sellers ranking",
    "amzNewReleases":      "New releases ranking"
}
Enter fullscreen mode Exit fullscreen mode

Async Mode: Complete Implementation

1. Submit Tasks

import requests
import time
from typing import Optional

ASYNC_URL = "https://scrapeapi.pangolinfo.com/api/v1/scrape/async"

def async_submit(
    token: str,
    asin: str,
    callback_url: str,
    parser_name: str = "amzProductDetail",
    zipcode: str = "10041"
) -> Optional[str]:
    """Submit async scraping task. Returns task_id immediately."""
    resp = requests.post(
        ASYNC_URL,
        headers={"Authorization": f"Bearer {token}", "Content-Type": "application/json"},
        json={
            "url": f"https://www.amazon.com/dp/{asin}",
            "callbackUrl": callback_url,
            "zipcode": zipcode,
            "format": "json",
            "parserName": parser_name
        },
        timeout=10  # Submission itself is fast
    )
    data = resp.json()
    if data.get("code") == 0:
        return data["data"]["data"]  # task_id
    print(f"Submit failed for {asin}: {data.get('message')}")
    return None


def bulk_submit(
    token: str,
    asins: list[str],
    callback_url: str,
    rate_delay: float = 0.1  # seconds between submissions
) -> dict[str, str]:
    """Bulk submit. Returns {asin: task_id}."""
    task_map = {}
    for i, asin in enumerate(asins):
        task_id = async_submit(token, asin, callback_url)
        if task_id:
            task_map[asin] = task_id
        if (i + 1) % 500 == 0:
            print(f"Submitted {i+1}/{len(asins)}")
        time.sleep(rate_delay)
    return task_map
Enter fullscreen mode Exit fullscreen mode

2. Callback Receiver (FastAPI)

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = FastAPI()
processed_ids: set[str] = set()  # Use Redis in production

@app.post("/api/callback")
async def receive_amazon_callback(request: Request):
    payload = await request.json()
    task_id = payload.get("taskId")

    # Idempotency guard - critical for production
    if task_id in processed_ids:
        return JSONResponse({"code": 0, "message": "duplicate"})
    processed_ids.add(task_id)

    # Extract results
    results = (
        payload
        .get("data", {})
        .get("json", [{}])[0]
        .get("data", {})
        .get("results", [])
    )
    if results:
        product = results[0]
        print(f"Received: {product.get('title', 'N/A')[:60]}")
        # persist_to_db(task_id, product)

    return JSONResponse({"code": 0, "message": "ok"})
Enter fullscreen mode Exit fullscreen mode

3. Timeout Compensation

def compensate_missing_callbacks(
    token: str,
    task_map: dict[str, str],
    received_ids: set[str]
) -> None:
    """
    For tasks that didn't callback within expected window,
    fetch results via sync API as fallback.
    """
    missing = {asin: tid for asin, tid in task_map.items()
               if tid not in received_ids}

    for asin, task_id in missing.items():
        print(f"Compensating {asin} (task {task_id})")
        result = sync_scrape(token, asin)  # fallback to sync
        if result:
            print(f"Compensated: {result.get('title', 'N/A')[:40]}")
Enter fullscreen mode Exit fullscreen mode

Decision Framework

Daily task volume?
│
├── < 100 tasks/day
│   └── Need instant results?
│       ├── Yes → Sync API ✓
│       └── No  → Sync API ✓ (async overhead not worth it)
│
└── ≥ 100 tasks/day OR scheduled batches
    └── Can deploy public callback server?
        ├── Yes → Async API ✓ (10x+ throughput)
        └── No  → Sync + threading, or AMZ Data Tracker (no-code)
Enter fullscreen mode Exit fullscreen mode

Local Development Setup

Testing async locally without deploying to a server:

# Terminal 1: Run your callback receiver
uvicorn main:app --port 5000

# Terminal 2: Expose it publicly via ngrok
ngrok http 5000
# → Forwarding: https://abc123.ngrok.io → http://localhost:5000

# Use the ngrok URL as your callbackUrl for testing
CALLBACK_URL = "https://abc123.ngrok.io/api/callback"
Enter fullscreen mode Exit fullscreen mode

Performance Benchmarks

Task Count Sync (single-thread) Async
100 ~8 min ~2 min
1,000 ~83 min ~5-10 min
5,000 ~7 hrs ~20-40 min
10,000 ~14 hrs ~45-90 min

Credit cost is identical in both modes. Performance gap grows linearly with volume.


Resources

Questions? Drop them in the comments.

Top comments (0)