agenthustler

Posted on Apr 9

How to Scrape Amazon Product Data in 2026: Prices, Reviews, and Rankings

#webscraping #python #automation #ecommerce

Amazon is the world's largest product database — over 350 million products with real-time pricing, millions of customer reviews, and Best Sellers Rank (BSR) data that reveals actual sales velocity. For anyone building price comparison tools, market research platforms, or e-commerce analytics, Amazon data is the foundation.

This guide covers how to extract Amazon product data programmatically in 2026, with production-ready Python code and strategies for handling Amazon's aggressive anti-bot systems.

Understanding Amazon's Data Architecture

Amazon organizes products around ASINs (Amazon Standard Identification Numbers) — unique 10-character alphanumeric identifiers. Every product page, review set, and price history is tied to an ASIN.

Key data points available on each product page:

Product details: Title, description, bullet points, brand, category
Pricing: Current price, list price, deal price, Subscribe & Save price
Best Sellers Rank (BSR): Ranking within category — the closest proxy to actual sales volume
Reviews: Rating distribution, total count, individual review text
Variations: Size/color/style variants with their own prices and availability
Seller information: Sold by, fulfilled by, number of sellers, Buy Box winner

Amazon product URLs follow this structure:

https://www.amazon.com/dp/B0XXXXXXXXX
https://www.amazon.com/gp/product/B0XXXXXXXXX
https://www.amazon.com/Product-Name/dp/B0XXXXXXXXX

All three resolve to the same page. The short /dp/ASIN format is the most reliable for scraping.

Setting Up Your Scraper

Unlike many sites, Amazon product pages are largely server-rendered HTML. This means you can use httpx (or requests) without needing a full browser — which is faster and cheaper:

import httpx
import random
import time
from selectolax.parser import HTMLParser


class AmazonScraper:
    def __init__(self):
        self.user_agents = [
            (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            (
                "Mozilla/5.0 (X11; Linux x86_64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
        ]

    def _get_headers(self):
        return {
            "User-Agent": random.choice(self.user_agents),
            "Accept": (
                "text/html,application/xhtml+xml,"
                "application/xml;q=0.9,*/*;q=0.8"
            ),
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }

    def fetch_page(self, url, retries=3):
        for attempt in range(retries):
            try:
                response = httpx.get(
                    url,
                    headers=self._get_headers(),
                    follow_redirects=True,
                    timeout=15.0,
                )
                if response.status_code == 200:
                    return response.text
                if response.status_code == 503:
                    # CAPTCHA page — back off
                    time.sleep(random.uniform(10, 20))
                    continue
            except httpx.TimeoutException:
                time.sleep(random.uniform(2, 5))
                continue
        return None

Scraping Product Pages

Product data extraction is mostly about knowing the right CSS selectors. Amazon's HTML is verbose but stable — the #productTitle, #priceblock_ourprice, and similar IDs have been consistent for years:

def parse_product(self, html, asin):
    tree = HTMLParser(html)

    def text(selector):
        node = tree.css_first(selector)
        return node.text(strip=True) if node else None

    # Extract price — Amazon uses multiple price containers
    price = None
    price_selectors = [
        "span.a-price span.a-offscreen",
        "#priceblock_ourprice",
        "#priceblock_dealprice",
        "span[data-a-color='price'] span.a-offscreen",
    ]
    for sel in price_selectors:
        price = text(sel)
        if price:
            break

    # Extract BSR from product details
    bsr = None
    details_nodes = tree.css("#detailBulletsWrapper_feature_div li")
    for node in details_nodes:
        node_text = node.text()
        if "Best Sellers Rank" in node_text:
            bsr = node_text.strip()
            break

    # Alternative BSR location (table format)
    if not bsr:
        table_rows = tree.css("#productDetails_detailBulletsTable_techSpec_section tr")
        for row in table_rows:
            if "Best Sellers Rank" in row.text():
                bsr = row.text(strip=True)
                break

    # Extract rating
    rating = text("#acrPopover span.a-size-base")
    review_count = text("#acrCustomerReviewText")

    return {
        "asin": asin,
        "title": text("#productTitle"),
        "price": price,
        "rating": rating,
        "review_count": review_count,
        "bsr": bsr,
        "brand": text("#bylineInfo"),
        "availability": text("#availability span"),
        "bullet_points": [
            li.text(strip=True)
            for li in tree.css("#feature-bullets li span")
        ],
    }


def scrape_product(self, asin):
    url = f"https://www.amazon.com/dp/{asin}"
    html = self.fetch_page(url)
    if not html:
        return None
    return self.parse_product(html, asin)

Extracting Reviews at Scale

Reviews live on a separate URL structure and support pagination. Each page shows 10 reviews by default:

def scrape_reviews(self, asin, max_pages=10):
    reviews = []

    for page in range(1, max_pages + 1):
        url = (
            f"https://www.amazon.com/product-reviews/{asin}"
            f"?pageNumber={page}&sortBy=recent"
        )
        html = self.fetch_page(url)
        if not html:
            break

        tree = HTMLParser(html)
        review_divs = tree.css('[data-hook="review"]')

        if not review_divs:
            break  # No more reviews

        for div in review_divs:
            title_node = div.css_first('[data-hook="review-title"]')
            body_node = div.css_first('[data-hook="review-body"]')
            rating_node = div.css_first('[data-hook="review-star-rating"]')
            date_node = div.css_first('[data-hook="review-date"]')
            verified_node = div.css_first(
                '[data-hook="avp-badge"]'
            )

            review = {
                "asin": asin,
                "title": (
                    title_node.text(strip=True)
                    if title_node else None
                ),
                "body": (
                    body_node.text(strip=True)
                    if body_node else None
                ),
                "rating": (
                    rating_node.text(strip=True)[:3]
                    if rating_node else None
                ),
                "date": (
                    date_node.text(strip=True)
                    if date_node else None
                ),
                "verified": verified_node is not None,
            }
            reviews.append(review)

        # Respectful delay between pages
        time.sleep(random.uniform(2.0, 5.0))

    return reviews

Tracking Prices and BSR Over Time

Single snapshots are useful, but the real value is in time-series data. Here's a lightweight approach using SQLite:

import sqlite3
from datetime import datetime


class PriceTracker:
    def __init__(self, db_path="amazon_prices.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS price_history (
                asin TEXT,
                price REAL,
                bsr TEXT,
                rating TEXT,
                review_count TEXT,
                timestamp TEXT,
                PRIMARY KEY (asin, timestamp)
            )
        """)
        self.conn.commit()

    def record(self, product_data):
        price_str = product_data.get("price", "")
        price_val = None
        if price_str:
            cleaned = price_str.replace("$", "").replace(",", "")
            try:
                price_val = float(cleaned)
            except ValueError:
                pass

        self.conn.execute(
            "INSERT OR REPLACE INTO price_history VALUES (?, ?, ?, ?, ?, ?)",
            (
                product_data["asin"],
                price_val,
                product_data.get("bsr"),
                product_data.get("rating"),
                product_data.get("review_count"),
                datetime.utcnow().isoformat(),
            ),
        )
        self.conn.commit()

    def get_history(self, asin, days=30):
        cursor = self.conn.execute(
            """
            SELECT price, bsr, timestamp
            FROM price_history
            WHERE asin = ?
              AND timestamp > datetime('now', ?)
            ORDER BY timestamp
            """,
            (asin, f"-{days} days"),
        )
        return cursor.fetchall()

Run your scraper on a schedule (cron, Apify schedules, or a simple loop) to build up historical data that reveals pricing patterns and sales trends.

Handling Amazon's Anti-Bot Defenses

Amazon has some of the most sophisticated bot detection on the web. Here's what works in 2026:

1. Request Pacing

Never hit Amazon faster than one request every 2-3 seconds from the same IP. Burst traffic is the fastest way to get blocked:

import time
import random


def respectful_delay():
    """Wait 2-5 seconds between requests."""
    time.sleep(random.uniform(2.0, 5.0))

2. Proxy Rotation

For any serious volume, residential proxies are essential. Datacenter IPs get flagged quickly:

def fetch_with_proxy(self, url, proxy):
    return httpx.get(
        url,
        headers=self._get_headers(),
        proxy=proxy,
        follow_redirects=True,
        timeout=15.0,
    )

3. CAPTCHA Detection

When Amazon returns a 503 or a page containing "Enter the characters you see below", you've been flagged. Back off significantly:

def is_captcha_page(self, html):
    captcha_signals = [
        "Enter the characters you see below",
        "api-services-support@amazon.com",
        "Type the characters you see in this image",
    ]
    return any(signal in html for signal in captcha_signals)

4. Session Management

Rotate sessions (cookies + user agent) every 50-100 requests. Amazon tracks behavioral fingerprints across requests.

Production-Ready Alternative

If you're building a product that depends on Amazon data, maintaining your own scraping infrastructure is a significant ongoing commitment. Proxy costs, selector maintenance, CAPTCHA handling, and anti-bot countermeasures add up fast.

Our Amazon Scraper on Apify handles all of this out of the box — structured JSON output for products, reviews, pricing, and BSR data with built-in proxy management and automatic retries:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("cryptosignals/amazon-scraper").call(
    run_input={
        "searchTerms": ["wireless headphones"],
        "maxResults": 50,
        "includeReviews": True,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} - {item['price']} - BSR: {item['bsr']}")

This lets you focus on building your analytics or pricing tool rather than fighting Amazon's bot detection.

Legitimate Use Cases

Price Comparison Engines: Monitor prices across Amazon and competitors to surface the best deals. This is the foundation of services like CamelCamelCamel and Keepa.

Market Research and Product Validation: Before launching a product, analyze the competitive landscape — pricing, review sentiment, BSR trends, and market gaps. BSR data is particularly valuable because it's the best publicly available proxy for actual sales volume.

Competitive Intelligence: Track competitor products' pricing changes, new review patterns, and ranking movements. A sudden BSR improvement might indicate a successful ad campaign or viral moment.

Review Analysis and Sentiment Monitoring: Aggregate and analyze review text to identify common complaints, feature requests, and quality issues across product categories.

Inventory and Availability Monitoring: Track stock levels and availability patterns for supply chain analysis or arbitrage opportunities.

Academic Research: E-commerce researchers study pricing dynamics, review authenticity, marketplace competition, and consumer behavior using Amazon data.

Conclusion

Amazon scraping in 2026 is technically straightforward — the data is in server-rendered HTML, the URL structure is predictable, and the selectors are stable. The challenge is scale: Amazon's anti-bot systems are world-class, and maintaining reliable extraction at volume requires proxy infrastructure, session management, and CAPTCHA handling.

For prototyping and small-scale research, the code examples in this guide will get you started. For production workloads where reliability matters, consider a managed platform that handles the infrastructure complexity.

The e-commerce data market continues to grow. Whether you're building a price tracker, a market research tool, or a competitive intelligence platform, programmatic access to Amazon's product catalog is a powerful foundation.

DEV Community