From 1M Records/Month to 10M/Day: A Real Case Study in Amazon Data Infrastructure

#api #tooling #scraper #amazondata

Who This Is For

If you're building an Amazon seller tool, a data analytics product, or any SaaS where web data is the core value proposition — this case study is probably directly relevant to your stack decisions. It's not theoretical. It's a documented infrastructure migration, including the POC numbers, migration approach, and six-month outcomes.

Background

SellerPulse (anonymized) is a leading Amazon seller tool platform: 32,000+ registered users, ~18% paid conversion, MRR in the $1M+ range. Core product: real-time competitor price monitoring and SP ad placement tracking.

By late 2024, their self-built scraping infrastructure had become the company's biggest operational liability:

200 scraper nodes + 7 dedicated maintenance engineers
$12,000/month in IP infrastructure spend
SP ad slot capture rate: 62% (38% of their most-valued data type was simply missing)
Average data latency: 52 hours (their marketing said "real-time")
Quarterly outages triggered by Amazon anti-scraping policy updates

The Infrastructure Problem, Technically

Amazon's anti-bot systems underwent a significant architectural shift in 2024. Before this, IP rotation + reasonable request throttling was largely sufficient. Post-2024, Amazon deployed behavioral fingerprinting and session continuity verification — meaning that even with rotating IPs, requests exhibiting non-human session patterns (wrong resource loading order, atypical scroll behavior, browser fingerprint anomalies) are systematically identified and served degraded content.

For data center IPs specifically, this produces a systemic SP ad capture problem: Amazon's ad auction system serves authentic ad results based on perceived request legitimacy. Requests from data center ranges get systematically stripped of ad content or receive placeholder responses. That's the architectural reason SellerPulse's capture rate was stuck at 62% — not bad engineering, but a fundamental infrastructure constraint.

The fix requires:

Residential-grade IP infrastructure (looks like real user traffic to Amazon's systems)
Full browser rendering (JS execution, realistic resource loading sequence, scroll simulation)
Session continuity (consistent browser fingerprint across a session lifecycle)

These three requirements together are expensive and complex to maintain at scale for a company whose core competency is building a seller tool product, not running residential proxy networks.

The Solution: Pangolinfo Scrape API

After a two-week POC on their actual production URL sets, the numbers:

Metric	Self-built	Pangolinfo
SP ad slot capture rate	62%	97.3%
Data latency (P95)	52 hours	18 minutes
Zip-code targeting	Not supported	Supported natively
Customer Says field	Not supported	91.2% coverage
Pricing model	Fixed high cost	Pay-per-use

The zip-code and Customer Says findings were surprises. SellerPulse hadn't been able to do region-specific ad collection at all — Pangolinfo's native geographic targeting support opened entirely new product feature directions. The Customer Says field (Amazon's AI-generated review sentiment summary) had been completely inaccessible to self-managed scrapers; at 91%+ coverage it immediately became a candidate for a new competitive intelligence feature.

The Implementation: 90-Day Phased Migration

Core Approach: Traffic Splitting + Parallel Comparison

No downtime. No big-bang cutover. Traffic splitting from day one.

# Simplified traffic routing logic
import random

def route_scrape_request(url: str, pangolinfo_ratio: float = 0.20) -> str:
    """
    Route scraping requests between legacy and Pangolinfo systems.
    pangolinfo_ratio increases from 0.20 → 1.00 over 60 days.
    """
    if random.random() < pangolinfo_ratio:
        return "pangolinfo"
    return "legacy"

# Week 1-2: 20% Pangolinfo
# Week 3-4: 40% Pangolinfo  
# Week 5-6: 60% Pangolinfo
# Week 7-8: 80% Pangolinfo
# Week 9+:  100% Pangolinfo (core pipelines)

Parallel Comparison Layer

Both systems run simultaneously during migration; a comparison service logs divergence for manual review:

import hashlib
import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class ComparisonResult:
    url: str
    legacy_fingerprint: Optional[str]
    pangolinfo_fingerprint: Optional[str]
    match: bool
    divergent_fields: list

def compare_outputs(legacy_data: dict, pangolinfo_data: dict, key_fields: list) -> ComparisonResult:
    """Compare outputs from both systems on key fields."""

    def fingerprint(data, fields):
        subset = {k: data.get(k) for k in fields if data}
        return hashlib.md5(json.dumps(subset, sort_keys=True).encode()).hexdigest()

    lf = fingerprint(legacy_data, key_fields)
    pf = fingerprint(pangolinfo_data, key_fields)

    divergent = []
    if legacy_data and pangolinfo_data:
        for field in key_fields:
            if legacy_data.get(field) != pangolinfo_data.get(field):
                divergent.append(field)

    return ComparisonResult(
        url=legacy_data.get("url", "") if legacy_data else "",
        legacy_fingerprint=lf,
        pangolinfo_fingerprint=pf,
        match=(lf == pf),
        divergent_fields=divergent
    )

Phase 3 — New Feature Rollout (Days 61-90)

With the core pipeline complete, features that had been blocked by infrastructure constraints shipped in rapid succession:

# Customer Says extraction — previously impossible with self-built scrapers
def extract_customer_says(product_data: dict) -> dict:
    """
    Extract Amazon's AI-generated review summary (Customer Says field)
    Available in Pangolinfo structured JSON output by default
    """
    customer_says = product_data.get("customer_says", {})

    return {
        "positive_attributes": customer_says.get("positive", []),
        "negative_attributes": customer_says.get("negative", []),
        "common_mentions": customer_says.get("common_themes", []),
        "sentiment_score": customer_says.get("overall_sentiment"),
        "mention_count": customer_says.get("review_count", 0)
    }

# Full category New Releases monitoring — previously too expensive to run at scale
def monitor_new_releases(categories: list, api_key: str):
    """Now economically viable at commodity per-record rates"""
    from pangolin_client import PangolinScrapeClient, ScrapeConfig

    client = PangolinScrapeClient(api_key)
    results = []

    for category in categories:
        config = ScrapeConfig(
            url=f"https://www.amazon.com/gp/new-releases/{category}/",
            parse_template="amazon_new_releases",
            extract_fields=["rank", "asin", "title", "price", "rating", "launch_date", "badge"]
        )
        data = client.scrape(config)
        if data:
            results.append(data)

    return results

Results: 6 Months Post-Migration

BEFORE vs AFTER

Daily Records:      29,000          →    12,000,000   (+41,000%)
SP Ad Capture:      62%             →    98.1%         (+36.1ppt)
Latency (avg):      52 hours        →    13 minutes    (-99.6%)
Customer Says:      0%              →    91%           (new capability)
Monthly Cost:       ~$30,000        →    ~$13,500      (-55%)
                    (at 870K/mo)          (at 330M/mo)

Paid Conversion:    18.3%           →    22.7%         (+4.4ppt)
Monthly Churn:      11.8%           →    6.4%          (-5.4ppt)
Monthly ARR:        $1.1M           →    $1.57M        (+42.5%)

Engineering (crawler maintenance): -85%
New features shipped (30 days post-migration): 4
Annualized ROI: 14.3x

Dev Notes: What Actually Tripped Us Up

Zip code matters more than you'd expect. Different US zip codes can show materially different prices (shipping cost locality), different ad inventories (regional ad targeting), and different Prime badge availability. For any feature involving geographic competitive analysis, setting geo.zip_code correctly isn't optional.

Customer Says field isn't on every ASIN. Amazon's AI review summaries require a threshold of reviews before generating. Expect ~8-12% of ASINs to return empty for this field — handle gracefully, don't treat as an error.

Concurrency sweet spot. They settled on semaphore-controlled async at 20 concurrent requests for standard tasks, bumped to 50 during peak events with Pangolinfo's expanded allocation. Error rate at 20: <0.1%. Setting it higher without coordination with Pangolinfo's account team caused transient throttling.

Parse templates vs. manual field extraction. For common scenarios (bestsellers, search results, product detail pages), Pangolinfo's prebuilt parse templates save significant engineering time. For unusual ASINs or edge-case category structures, the extract_fields approach with explicit field targeting is more reliable.

Conclusion

The core engineering takeaway: past roughly 1-2M records/day, self-managed scraping infrastructure for Amazon data becomes increasingly difficult to maintain at production quality levels. The combination of behavioral fingerprinting, session continuity requirements, residential IP needs, and the ongoing anti-bot escalation pattern makes it a moving target that academic engineering quality alone can't consistently hit.

For teams considering a similar migration:

Run your own POC first. The numbers from vendor case studies (including this one) are real, but your URL distribution, category mix, and geo targeting requirements will produce your own baseline. Two weeks of POC on your actual production URLs tells you more than any case study.
Plan for a phased migration. Big-bang cutovers on live data pipelines are unnecessary and risky. Traffic splitting plus parallel comparison is the standard approach for good reason.
Budget for Phase 3. The features you'll unlock post-migration are real business value. Factor them into your ROI calculations.

API documentation: docs.pangolinfo.com
Free trial: pangolinfo.com/scrape-api

Tags: #api #python #ecommerce #webdev #scrapers #amazon #dataengineering #saasbuild