DEV Community

Cover image for Amazon Scraper API Benchmark: 12M Requests Across 4 Platforms — What the Data Actually Shows
Mox Loop
Mox Loop

Posted on

Amazon Scraper API Benchmark: 12M Requests Across 4 Platforms — What the Data Actually Shows

Amazon Scraper API Benchmark: 12M Requests Across 4 Platforms — What the Data Actually Shows

TL;DR

  • Self-built scrapers: 71.4% product page success rate at scale, 60% of engineering time on anti-bot maintenance
  • Competitor A: 89.1% product pages, 81.2% SP ad slots, 8,900ms P99 — workable but with real blind spots
  • Pangolinfo: 98.6% product pages, 97.3% SP ad slots, 3,890ms P99, full Customer Says extraction
  • Cost delta: ~¥27,500/month savings at 100K pages/day vs self-built
  • Key limitation: Non-Amazon platform (Walmart/Shopee) maturity gaps, English docs update faster than Chinese

Full benchmark methodology and data below. Code examples included.


Why I Ran This Test

Our team had been running a self-built Amazon scraping infrastructure for about eight months when the maintenance burden became impossible to ignore. Not because anything was catastrophically broken — but because the engineering economics had quietly inverted.

Amazon's anti-bot infrastructure in 2025 is not the problem it was in 2022. Behavioral fingerprinting, CAPTCHA rotation, JavaScript rendering detection, headless browser identification — each layer requires dedicated engineering to counter, and Amazon ships updates to these systems constantly. We were spending 60% of engineering hours on anti-scraping countermeasures. The remaining 40% went to actual business logic. Forrester Research's 2024 benchmark puts average self-built scraper maintenance at 40–60 hours per month; we were running hot relative to that baseline.

The decision to evaluate commercial Amazon scraper API alternatives was ultimately an engineering economics decision, not a technical one.


Test Setup

Three systems ran simultaneously across the full 60-day period: Pangolinfo Scrape API, Competitor A (commercial reasons preclude naming), and our self-built infrastructure.

All requests were production traffic — real business data needs, not synthetic benchmark loads. Coverage:

  • Amazon product detail pages (ASIN-level)
  • Search results pages (keyword + category)
  • BSR list pages (Best Sellers, New Releases, Movers & Shakers)
  • Sponsored Products ad slot data
  • Review pages including Customer Says AI summary module
  • Platforms: Amazon US, UK, Japan; Walmart US

Primary metrics: collection success rate, P50–P99 latency, JSON output completeness, SP ad slot capture rate, Customer Says field integrity.


Results

Success Rates

60-day averages:

results = {
    "product_detail_page": {
        "pangolinfo": 98.6,
        "competitor_a": 89.1,
        "self_built": 71.4,  # drops lower during CAPTCHA events
    },
    "search_results_page": {
        "pangolinfo": 97.2,
        "competitor_a": 84.3,
        "self_built": 62.8,
    },
    "review_pages": {
        "pangolinfo": 96.8,
        "competitor_a": 79.6,
        "self_built": 55.1,
    },
    "sp_ad_slots": {
        "pangolinfo": 97.3,
        "competitor_a": 81.2,
        "self_built": 38.4,  # multiple CAPTCHA-triggered outages in this period
    },
    "bsr_list_pages": {
        "pangolinfo": 98.1,
        "competitor_a": 86.7,
        "self_built": 65.2,
    }
}
Enter fullscreen mode Exit fullscreen mode

The self-built ad slot number (38.4%) includes several CAPTCHA-triggered outage windows. Stable-period average was approximately 55%. The distinction matters because "outage events are operational reality, not edge cases" — so 60-day average is the more honest metric.

Response Latency (Amazon Product Detail Pages, ms)

latency_ms = {
    "P50":  {"pangolinfo": 890,   "competitor_a": 1450},
    "P75":  {"pangolinfo": 1240,  "competitor_a": 2100},
    "P90":  {"pangolinfo": 1780,  "competitor_a": 3200},
    "P95":  {"pangolinfo": 2340,  "competitor_a": 4800},
    "P99":  {"pangolinfo": 3890,  "competitor_a": 8900},
}
Enter fullscreen mode Exit fullscreen mode

P50 gap (890ms vs 1450ms) is real but not operationally decisive for most workflows. P99 gap (3,890ms vs 8,900ms) is the constraining number — it defines what SLA you can responsibly promise for time-sensitive workflows. Pangolinfo's ceiling is less than half Competitor A's.

SP Ad Slot Capture — The Key Differentiator

100K dedicated SP ad slot requests. Results:

  • Pangolinfo: 97.3% (official claim 98%; 0.7pp gap, within expected variance)
  • Competitor A: 81.2%
  • Difference: 16.1 percentage points

Practical translation: monitoring 500 keywords daily, Competitor A's feed is missing ~81 keyword ad slot positions every day. Whether those happen to be the keywords where your competitor is making a strategic push is unknowable — which is the point.


Deep Dive: Customer Says Extraction


Amazon's Customer Says module is an AI-generated review summary that condenses product feedback into structured positive/negative highlights. It's high-information-density data for competitive product positioning. It's also technically difficult to extract reliably.

Technical challenge layers:
- Layer 1 (static HTML): Most scrapers handle this fine
- Layer 2 (JS-rendered content): Requires full browser rendering
- Layer 3 (Customer Says — dynamic conditional loading):
    - Different load triggers per ASIN/category characteristics
    - Amazon-specific protection for this module
    - Structure varies by content type

Pangolinfo Reviews Scraper API: Layer 3 ✓ (complete, stable)
Competitor A: Layer 2 only (Customer Says score: 5/10)
Self-built: Layer 1–2 (Layer 3: occasional partial returns, unreliable)
Enter fullscreen mode Exit fullscreen mode

Working example with Pangolinfo Reviews Scraper API:

import requests
from typing import Optional

def fetch_reviews_with_customer_says(
    asin: str,
    api_key: str,
    marketplace: str = "US",
    star_filter: Optional[list] = None,
    count: int = 20
) -> dict:
    """
    Fetch reviews + Customer Says summary via Pangolinfo Reviews Scraper API.
    Customer Says = Amazon's AI-generated review summary module.
    Competitor A scored 5/10 on this capability in independent testing.
    """
    payload = {
        "asin": asin,
        "marketplace": marketplace,
        "sort": "recent",
        "count": count,
        "include_customer_says": True,  # Key parameter for the AI summary
        "fields": [
            "rating",
            "review_text",
            "review_title",
            "verified_purchase",
            "date",
            "helpful_votes"
        ]
    }

    if star_filter:
        payload["star_filter"] = star_filter  # e.g., [1, 2] for negative reviews only

    response = requests.post(
        "https://api.pangolinfo.com/v1/amazon/reviews",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json=payload,
        timeout=30
    )
    response.raise_for_status()
    data = response.json()

    return {
        # AI-generated summary — the high-value field
        "customer_says": {
            "positive": data.get("customer_says", {}).get("positive", ""),
            "negative": data.get("customer_says", {}).get("negative", ""),
        },
        "reviews": data.get("reviews", []),
        "total_count": data.get("total_review_count"),
        "average_rating": data.get("average_rating")
    }


# Usage: fetch negative reviews + Customer Says for competitor ASIN
result = fetch_reviews_with_customer_says(
    asin="B07EXAMPLE1",
    api_key="YOUR_PANGOLINFO_KEY",
    star_filter=[1, 2],  # Only 1-2 star reviews
    count=50
)

print("Customer Says Positive:", result["customer_says"]["positive"])
# Example: "Customers appreciate the durable build quality and simple setup process."

print("Customer Says Negative:", result["customer_says"]["negative"])  
# Example: "Some customers report size inconsistencies and slow customer support response."

# These two sentences compress dozens of reviews into actionable competitive intelligence
Enter fullscreen mode Exit fullscreen mode

ZIP-Level Collection: Regional Pricing Intelligence

Amazon's Prime delivery zone pricing creates regional variation that cross-border sellers routinely underestimate. Same ASIN, different ZIP codes, potentially different prices, different Prime eligibility states, different delivery estimate messaging.

def compare_regional_pricing(
    asin: str,
    zip_codes: list,
    api_key: str
) -> dict:
    """
    Simulate what shoppers in different regions see for the same product.
    Pangolinfo supports ZIP-level differentiated collection.
    Competitor A: 5/10 support. Self-built: essentially impossible at scale.
    """
    results = {}

    for zip_code in zip_codes:
        response = requests.post(
            "https://api.pangolinfo.com/v1/amazon/product",
            headers={"Authorization": f"Bearer {api_key}"},
            json={
                "asin": asin,
                "marketplace": "US",
                "zip_code": zip_code,
                "fields": ["price", "prime_eligibility", "delivery_estimate", "availability"]
            },
            timeout=30
        )
        data = response.json()
        results[zip_code] = {
            "price": data.get("price", {}).get("current"),
            "prime": data.get("prime_eligibility"),
            "delivery": data.get("delivery_estimate"),
            "in_stock": data.get("availability") == "In Stock"
        }

    return results


# Compare New York (10001) vs Los Angeles (90001) vs Chicago (60601)
regional_data = compare_regional_pricing(
    asin="B07EXAMPLE1",
    zip_codes=["10001", "90001", "60601"],
    api_key="YOUR_KEY"
)

for zip_code, data in regional_data.items():
    print(f"ZIP {zip_code}: ${data['price']} | Prime: {data['prime']} | {data['delivery']}")
Enter fullscreen mode Exit fullscreen mode

Cost Model

At 100K pages/day:

monthly_costs = {
    "self_built": {
        "servers": 8_000,         # RMB
        "proxy_ips": 12_000,
        "engineering_hrs": 18_000,  # 60h × ¥300/hr
        "emergency_fixes": 4_000,
        "total": 42_000
    },
    "pangolinfo": {
        "api_fee": 8_500,          # estimated from pricing tiers
        "engineering_ongoing": 0,  # near-zero after initial integration
        "total": 8_500
    },
    "monthly_savings": 33_500
}
# 3-year TCO differential: ~¥1.2M
Enter fullscreen mode Exit fullscreen mode

Honest Limitations

Documentation parity: English API Reference updates faster than Chinese. Chinese-language technical teams should monitor both versions.

Non-Amazon platform maturity: Walmart and Shopee parsing templates have measurable gaps vs Amazon coverage. Specific field availability on Walmart varies by SKU type.

Peak concurrency: At 8M+ pages/day in pressure testing, ~3.2% request queue delays appeared. Imperceptible for normal business scenarios; needs pre-negotiation if you're building financial-grade real-time data infrastructure.


Resources


Have you run your own Amazon scraper API comparisons? I'm particularly curious about Walmart-focused use cases where the Pangolinfo maturity gap I noted might be more or less pronounced than I found. Drop your experience in the comments.

api #python #ecommerce #datascraping #amazonsellertools

Top comments (0)