DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Amazon Product Data in 2026: Prices, Reviews, and Rankings

Amazon is the world's largest product database — over 350 million products with real-time pricing, millions of customer reviews, and Best Sellers Rank (BSR) data that reveals actual sales velocity. For anyone building price comparison tools, market research platforms, or e-commerce analytics, Amazon data is the foundation.

This guide covers how to extract Amazon product data programmatically in 2026, with production-ready Python code and strategies for handling Amazon's aggressive anti-bot systems.

Understanding Amazon's Data Architecture

Amazon organizes products around ASINs (Amazon Standard Identification Numbers) — unique 10-character alphanumeric identifiers. Every product page, review set, and price history is tied to an ASIN.

Key data points available on each product page:

  • Product details: Title, description, bullet points, brand, category
  • Pricing: Current price, list price, deal price, Subscribe & Save price
  • Best Sellers Rank (BSR): Ranking within category — the closest proxy to actual sales volume
  • Reviews: Rating distribution, total count, individual review text
  • Variations: Size/color/style variants with their own prices and availability
  • Seller information: Sold by, fulfilled by, number of sellers, Buy Box winner

Amazon product URLs follow this structure:

https://www.amazon.com/dp/B0XXXXXXXXX
https://www.amazon.com/gp/product/B0XXXXXXXXX
https://www.amazon.com/Product-Name/dp/B0XXXXXXXXX
Enter fullscreen mode Exit fullscreen mode

All three resolve to the same page. The short /dp/ASIN format is the most reliable for scraping.

Setting Up Your Scraper

Unlike many sites, Amazon product pages are largely server-rendered HTML. This means you can use httpx (or requests) without needing a full browser — which is faster and cheaper:

import httpx
import random
import time
from selectolax.parser import HTMLParser


class AmazonScraper:
    def __init__(self):
        self.user_agents = [
            (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            (
                "Mozilla/5.0 (X11; Linux x86_64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        ]

    def _get_headers(self):
        return {
            "User-Agent": random.choice(self.user_agents),
            "Accept": (
                "text/html,application/xhtml+xml,"
                "application/xml;q=0.9,*/*;q=0.8"
            ),
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }

    def fetch_page(self, url, retries=3):
        for attempt in range(retries):
            try:
                response = httpx.get(
                    url,
                    headers=self._get_headers(),
                    follow_redirects=True,
                    timeout=15.0,
                )
                if response.status_code == 200:
                    return response.text
                if response.status_code == 503:
                    # CAPTCHA page — back off
                    time.sleep(random.uniform(10, 20))
                    continue
            except httpx.TimeoutException:
                time.sleep(random.uniform(2, 5))
                continue
        return None
Enter fullscreen mode Exit fullscreen mode

Scraping Product Pages

Product data extraction is mostly about knowing the right CSS selectors. Amazon's HTML is verbose but stable — the #productTitle, #priceblock_ourprice, and similar IDs have been consistent for years:

def parse_product(self, html, asin):
    tree = HTMLParser(html)

    def text(selector):
        node = tree.css_first(selector)
        return node.text(strip=True) if node else None

    # Extract price — Amazon uses multiple price containers
    price = None
    price_selectors = [
        "span.a-price span.a-offscreen",
        "#priceblock_ourprice",
        "#priceblock_dealprice",
        "span[data-a-color='price'] span.a-offscreen",
    ]
    for sel in price_selectors:
        price = text(sel)
        if price:
            break

    # Extract BSR from product details
    bsr = None
    details_nodes = tree.css("#detailBulletsWrapper_feature_div li")
    for node in details_nodes:
        node_text = node.text()
        if "Best Sellers Rank" in node_text:
            bsr = node_text.strip()
            break

    # Alternative BSR location (table format)
    if not bsr:
        table_rows = tree.css("#productDetails_detailBulletsTable_techSpec_section tr")
        for row in table_rows:
            if "Best Sellers Rank" in row.text():
                bsr = row.text(strip=True)
                break

    # Extract rating
    rating = text("#acrPopover span.a-size-base")
    review_count = text("#acrCustomerReviewText")

    return {
        "asin": asin,
        "title": text("#productTitle"),
        "price": price,
        "rating": rating,
        "review_count": review_count,
        "bsr": bsr,
        "brand": text("#bylineInfo"),
        "availability": text("#availability span"),
        "bullet_points": [
            li.text(strip=True)
            for li in tree.css("#feature-bullets li span")
        ],
    }


def scrape_product(self, asin):
    url = f"https://www.amazon.com/dp/{asin}"
    html = self.fetch_page(url)
    if not html:
        return None
    return self.parse_product(html, asin)
Enter fullscreen mode Exit fullscreen mode

Extracting Reviews at Scale

Reviews live on a separate URL structure and support pagination. Each page shows 10 reviews by default:

def scrape_reviews(self, asin, max_pages=10):
    reviews = []

    for page in range(1, max_pages + 1):
        url = (
            f"https://www.amazon.com/product-reviews/{asin}"
            f"?pageNumber={page}&sortBy=recent"
        )
        html = self.fetch_page(url)
        if not html:
            break

        tree = HTMLParser(html)
        review_divs = tree.css('[data-hook="review"]')

        if not review_divs:
            break  # No more reviews

        for div in review_divs:
            title_node = div.css_first('[data-hook="review-title"]')
            body_node = div.css_first('[data-hook="review-body"]')
            rating_node = div.css_first('[data-hook="review-star-rating"]')
            date_node = div.css_first('[data-hook="review-date"]')
            verified_node = div.css_first(
                '[data-hook="avp-badge"]'
            )

            review = {
                "asin": asin,
                "title": (
                    title_node.text(strip=True)
                    if title_node else None
                ),
                "body": (
                    body_node.text(strip=True)
                    if body_node else None
                ),
                "rating": (
                    rating_node.text(strip=True)[:3]
                    if rating_node else None
                ),
                "date": (
                    date_node.text(strip=True)
                    if date_node else None
                ),
                "verified": verified_node is not None,
            }
            reviews.append(review)

        # Respectful delay between pages
        time.sleep(random.uniform(2.0, 5.0))

    return reviews
Enter fullscreen mode Exit fullscreen mode

Tracking Prices and BSR Over Time

Single snapshots are useful, but the real value is in time-series data. Here's a lightweight approach using SQLite:

import sqlite3
from datetime import datetime


class PriceTracker:
    def __init__(self, db_path="amazon_prices.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS price_history (
                asin TEXT,
                price REAL,
                bsr TEXT,
                rating TEXT,
                review_count TEXT,
                timestamp TEXT,
                PRIMARY KEY (asin, timestamp)
            )
        """)
        self.conn.commit()

    def record(self, product_data):
        price_str = product_data.get("price", "")
        price_val = None
        if price_str:
            cleaned = price_str.replace("$", "").replace(",", "")
            try:
                price_val = float(cleaned)
            except ValueError:
                pass

        self.conn.execute(
            "INSERT OR REPLACE INTO price_history VALUES (?, ?, ?, ?, ?, ?)",
            (
                product_data["asin"],
                price_val,
                product_data.get("bsr"),
                product_data.get("rating"),
                product_data.get("review_count"),
                datetime.utcnow().isoformat(),
            ),
        )
        self.conn.commit()

    def get_history(self, asin, days=30):
        cursor = self.conn.execute(
            """
            SELECT price, bsr, timestamp
            FROM price_history
            WHERE asin = ?
              AND timestamp > datetime('now', ?)
            ORDER BY timestamp
            """,
            (asin, f"-{days} days"),
        )
        return cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode

Run your scraper on a schedule (cron, Apify schedules, or a simple loop) to build up historical data that reveals pricing patterns and sales trends.

Handling Amazon's Anti-Bot Defenses

Amazon has some of the most sophisticated bot detection on the web. Here's what works in 2026:

1. Request Pacing

Never hit Amazon faster than one request every 2-3 seconds from the same IP. Burst traffic is the fastest way to get blocked:

import time
import random


def respectful_delay():
    """Wait 2-5 seconds between requests."""
    time.sleep(random.uniform(2.0, 5.0))
Enter fullscreen mode Exit fullscreen mode

2. Proxy Rotation

For any serious volume, residential proxies are essential. Datacenter IPs get flagged quickly:

def fetch_with_proxy(self, url, proxy):
    return httpx.get(
        url,
        headers=self._get_headers(),
        proxy=proxy,
        follow_redirects=True,
        timeout=15.0,
    )
Enter fullscreen mode Exit fullscreen mode

3. CAPTCHA Detection

When Amazon returns a 503 or a page containing "Enter the characters you see below", you've been flagged. Back off significantly:

def is_captcha_page(self, html):
    captcha_signals = [
        "Enter the characters you see below",
        "api-services-support@amazon.com",
        "Type the characters you see in this image",
    ]
    return any(signal in html for signal in captcha_signals)
Enter fullscreen mode Exit fullscreen mode

4. Session Management

Rotate sessions (cookies + user agent) every 50-100 requests. Amazon tracks behavioral fingerprints across requests.

Production-Ready Alternative

If you're building a product that depends on Amazon data, maintaining your own scraping infrastructure is a significant ongoing commitment. Proxy costs, selector maintenance, CAPTCHA handling, and anti-bot countermeasures add up fast.

Our Amazon Scraper on Apify handles all of this out of the box — structured JSON output for products, reviews, pricing, and BSR data with built-in proxy management and automatic retries:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("cryptosignals/amazon-scraper").call(
    run_input={
        "searchTerms": ["wireless headphones"],
        "maxResults": 50,
        "includeReviews": True,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} - {item['price']} - BSR: {item['bsr']}")
Enter fullscreen mode Exit fullscreen mode

This lets you focus on building your analytics or pricing tool rather than fighting Amazon's bot detection.

Legitimate Use Cases

Price Comparison Engines: Monitor prices across Amazon and competitors to surface the best deals. This is the foundation of services like CamelCamelCamel and Keepa.

Market Research and Product Validation: Before launching a product, analyze the competitive landscape — pricing, review sentiment, BSR trends, and market gaps. BSR data is particularly valuable because it's the best publicly available proxy for actual sales volume.

Competitive Intelligence: Track competitor products' pricing changes, new review patterns, and ranking movements. A sudden BSR improvement might indicate a successful ad campaign or viral moment.

Review Analysis and Sentiment Monitoring: Aggregate and analyze review text to identify common complaints, feature requests, and quality issues across product categories.

Inventory and Availability Monitoring: Track stock levels and availability patterns for supply chain analysis or arbitrage opportunities.

Academic Research: E-commerce researchers study pricing dynamics, review authenticity, marketplace competition, and consumer behavior using Amazon data.

Conclusion

Amazon scraping in 2026 is technically straightforward — the data is in server-rendered HTML, the URL structure is predictable, and the selectors are stable. The challenge is scale: Amazon's anti-bot systems are world-class, and maintaining reliable extraction at volume requires proxy infrastructure, session management, and CAPTCHA handling.

For prototyping and small-scale research, the code examples in this guide will get you started. For production workloads where reliability matters, consider a managed platform that handles the infrastructure complexity.

The e-commerce data market continues to grow. Whether you're building a price tracker, a market research tool, or a competitive intelligence platform, programmatic access to Amazon's product catalog is a powerful foundation.

Top comments (0)