DEV Community

Cover image for Amazon Category Traversal: Achieving 95%+ Coverage of Front-end Visible Products
Mox Loop
Mox Loop

Posted on

Amazon Category Traversal: Achieving 95%+ Coverage of Front-end Visible Products

The Real Challenge in E-commerce Data Collection

When building an AI-powered product selection model, you quickly face a frustrating reality: data services claiming "comprehensive coverage" often capture less than 40% of front-end visible products. This isn't a data quality issue—it's a technical ceiling.

Over the years of building e-commerce data collection systems, I've encountered countless challenges and finally developed a solution that consistently achieves 95%+ coverage of front-end visible products. Today, I'm sharing this complete technical approach with you.

Understanding What "Coverage Rate" Really Means

Before diving into the technical solution, we need to clarify a critical question: What exactly does "coverage rate" mean?

Amazon's category database may store millions of ASINs, but these products exist in vastly different states:

  • 30-40% are zombie products - Delisted or permanently out of stock, invisible on the front-end
  • 15-25% are algorithmically hidden - Due to poor reviews, policy violations, or extremely low sales
  • Only 40-55% are truly front-end visible products - What users can actually find through search and filtering

When some service providers claim "50% coverage," if their baseline is all ASINs in the database, they may only be collecting less than half of front-end visible products.

The coverage rate discussed in this article is explicitly based on "front-end visible products" - those products users can find on Amazon's front-end through normal search, filtering, and browsing. Our goal: Capture everything users can find on the front-end, achieving 95%+ coverage.

Why Traditional Pagination Fails

Most developers' first instinct is simple: start from the category homepage, scrape page by page until hitting Amazon's 400-page limit.

Sounds reasonable, right? But this strategy has a fatal flaw.

Amazon's search result ranking algorithm prioritizes high-sales, high-rated products. Those newly listed, niche, or specially-priced items will never appear in the first 400 pages. You think you're doing full collection, but you're actually just repeatedly scraping that same 20-40% of popular products.

Even worse, Amazon's anti-scraping mechanism is sophisticated. It won't simply block your IP—instead, it starts returning incomplete product lists. Your code still runs normally, HTTP status codes remain 200, but the number of returned products quietly decreases.

The Core Solution: Parameter Combination Strategy

To achieve true category traversal, the key is understanding Amazon's data layering logic.

Amazon's product database isn't flat—it dynamically generates search results through combinations of parameters across multiple dimensions. Products in the same category can be segmented by price range, brand, rating, Prime eligibility, and dozens of other dimensions.

Here's a concrete example:

Suppose a category has 100,000 products. Direct pagination can only capture the first 8,000 (400 pages × 20 items/page). But if you split the price range into 10 tiers, then split each tier by rating into 5 levels, you theoretically get 50 different product subsets, each with up to 8,000 items.

Through extensive testing, I found these four parameter dimensions work best:

  • Price range (price)
  • Brand filtering (rh)
  • Rating range (avg_review)
  • Prime status (prime)

Practical Implementation: Electronics Category Case Study

Let me demonstrate with Amazon US's Electronics category (node ID: 172282), which contains over 5 million products.

Step 1: Get Category Metadata

import requests
from bs4 import BeautifulSoup

def get_category_metadata(node_id):
    url = f"https://www.amazon.com/s?i=specialty-aps&bbn={node_id}"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'en-US,en;q=0.9'
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract price ranges and brands
    # ... (implementation details)

    return {
        'price_ranges': price_ranges,
        'brands': brands[:30]
    }
Enter fullscreen mode Exit fullscreen mode

Step 2: Build Parameter Combination Matrix

Don't perform Cartesian product combinations of all parameters—that would generate tens of thousands of requests, most invalid. A more efficient method is hierarchical traversal: first split by price range, then subdivide each range by brand.

Step 3: Smart Pagination Logic

The key is recognizing when to stop pagination. I monitor the product duplication rate between adjacent pages:

def smart_pagination(base_url, max_pages=400):
    seen_asins = set()
    duplicate_threshold = 0.3
    page = 1

    while page <= max_pages:
        products = fetch_page_products(f"{base_url}&page={page}")

        current_asins = {p['asin'] for p in products}
        duplicate_rate = len(current_asins & seen_asins) / len(current_asins)

        if duplicate_rate > duplicate_threshold and page > 10:
            break

        seen_asins.update(current_asins)
        page += 1

    return all_products
Enter fullscreen mode Exit fullscreen mode

Three Key Technologies for 95%+ Coverage

1. Information Gain-Based Parameter Selection

Not all parameter combinations are valuable. My strategy calculates each parameter's contribution to coverage improvement in real-time, prioritizing combinations that bring the most new ASINs.

2. Bloom Filter Deduplication

When ASIN counts reach millions, traditional set or database deduplication becomes a performance bottleneck. I use Bloom Filters combined with periodic persistence. By properly setting parameters (3 hash functions, 10MB bit array), you can maintain extremely low false positive rates while controlling memory usage.

3. Reverse Validation Mechanism

This is the key to ensuring 95%+ coverage. After traversal completion, I randomly select some products and attempt to search for them on the front-end through different filtering conditions.

If a product can be found on the front-end but wasn't captured by our traversal algorithm, it indicates blind spots in parameter combinations requiring supplementary strategies.

Through this continuous validation and optimization, we ensure front-end visible product coverage rate stays consistently above 95%, with the remaining 5% typically being extreme edge cases or temporary products.

Building AI Training Datasets

Obtaining large-scale product data is just the first step. To transform it into high-quality AI training datasets, you need to consider:

1. Data Cleaning
Raw HTML contains massive promotional information, emoji symbols, HTML entity encodings—these need normalization.

2. Snapshot Collection
Complete all data collection within as short a time window as possible (24-48 hours), ensuring all product data corresponds to the same time slice.

3. Stratified Sampling
E-commerce categories exhibit long-tail distribution. When building training sets, divide products into tiers based on metrics like sales and review counts, then sample proportionally within each tier.

Cost-Benefit Analysis

Building an in-house scraper from scratch to stable operation typically requires 2-3 months, with costs including:

  • Labor: $2,000-4,000/month
  • Proxy IPs: $500-2,000/month
  • Cloud servers: $200-1,000/month
  • Storage: $100-500/month

In contrast, using Pangolin's Scrape API:

  • Total cost: $150-375 (one-time)
  • Integration time: 2-3 days

More importantly, the time cost. In fast-iterating AI projects, getting training data two months late might mean missing the market window.

Tools and Solutions

For teams needing large-scale e-commerce data collection:

Pangolin Scrape API

  • Multi-platform support: Amazon, Walmart, Shopify, eBay
  • Multiple output formats: Raw HTML, Markdown, structured JSON
  • High concurrency: Supports tens of millions of page collections daily
  • Real-time: Minute-level data updates

AMZ Data Tracker (Zero-code Solution)

  • Browser plugin with visual configuration
  • Collection by ASIN, keyword, store, or ranking
  • Minute-level scheduled tasks
  • Anomaly alerts and automatic Excel generation

Visit: www.pangolinfo.com

Conclusion

Amazon category traversal technology may seem like just a subdivision of data collection, but the thinking it represents—how to systematically explore complex data spaces, how to maximize coverage under resource constraints, how to transform raw data into structured knowledge—these capabilities have universal value in the AI era.

Remember: Whatever users can find on the front-end, we can completely capture. This is the truly commercially valuable data coverage rate.

Whether you're doing e-commerce product selection, market research, or machine learning, mastering these technologies gives you competitive advantages.


Keywords: Amazon category traversal, full product data collection, category scraping technology, AI training dataset construction, large-scale product data acquisition, Pangolin Scrape API

Author: Pangolinfo E-commerce Data Engineering Team

Published: December 2025

Reading Time: 12 minutes

Top comments (0)