Forem

agenthustler
agenthustler

Posted on

How to Scrape Amazon Product Data in 2026: Prices, Reviews, Seller Info with Python

Introduction

Amazon remains the largest e-commerce platform in the world, and extracting product data from it — prices, reviews, seller information — is one of the most common web scraping use cases. Whether you're building a price monitoring tool, doing competitive analysis, or conducting e-commerce research, knowing how to scrape Amazon effectively in 2026 is a valuable skill.

In this guide, I'll walk you through scraping Amazon product data using Python, covering the real challenges you'll face and practical solutions that actually work.

Why Scrape Amazon?

There are plenty of legitimate reasons to scrape Amazon product data:

  • Price monitoring: Track competitor pricing across thousands of products
  • Market research: Analyze product trends, review sentiment, and category performance
  • Competitive analysis: Monitor new sellers, pricing strategies, and product launches
  • Academic research: Study consumer behavior, pricing dynamics, and marketplace economics

The Challenges of Scraping Amazon in 2026

Before we dive into code, let's be honest about what you're up against:

  1. Aggressive bot detection: Amazon uses sophisticated fingerprinting, CAPTCHAs, and behavioral analysis
  2. Dynamic content: Many product pages load data via JavaScript
  3. Rate limiting: Too many requests from one IP will get you blocked fast
  4. Changing HTML structure: Amazon frequently updates their page layouts

Setting Up Your Environment

First, install the required packages:

pip install requests beautifulsoup4 lxml
Enter fullscreen mode Exit fullscreen mode

Basic Amazon Scraper

Here's a straightforward scraper that extracts product data from an Amazon product page:

import requests
from bs4 import BeautifulSoup
import time
import random

class AmazonScraper:
    def __init__(self):
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/124.0.0.0 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        }

    def scrape_product(self, url: str) -> dict:
        """Scrape a single Amazon product page."""
        response = self.session.get(url, headers=self.headers)

        if response.status_code != 200:
            print(f"Failed to fetch {url}: {response.status_code}")
            return {}

        soup = BeautifulSoup(response.text, 'lxml')

        product = {
            'title': self._get_title(soup),
            'price': self._get_price(soup),
            'rating': self._get_rating(soup),
            'review_count': self._get_review_count(soup),
            'seller': self._get_seller(soup),
            'availability': self._get_availability(soup),
        }
        return product

    def _get_title(self, soup):
        el = soup.select_one('#productTitle')
        return el.text.strip() if el else None

    def _get_price(self, soup):
        selectors = [
            'span.a-price span.a-offscreen',
            '#priceblock_ourprice',
            '#priceblock_dealprice',
            'span.a-price-whole',
        ]
        for sel in selectors:
            el = soup.select_one(sel)
            if el:
                return el.text.strip()
        return None

    def _get_rating(self, soup):
        el = soup.select_one('#acrPopover')
        if el:
            title = el.get('title', '')
            return title.split(' ')[0] if title else None
        return None

    def _get_review_count(self, soup):
        el = soup.select_one('#acrCustomerReviewText')
        return el.text.strip() if el else None

    def _get_seller(self, soup):
        el = soup.select_one('#sellerProfileTriggerId')
        return el.text.strip() if el else 'Amazon'

    def _get_availability(self, soup):
        el = soup.select_one('#availability span')
        return el.text.strip() if el else None
Enter fullscreen mode Exit fullscreen mode

Adding Proxy Rotation

Here's the thing — the basic scraper above will work for a few requests, then Amazon will block you. You need proxy rotation for any serious scraping work.

import itertools

class ProxyRotator:
    def __init__(self, proxy_list: list[str]):
        self.proxies = itertools.cycle(proxy_list)

    def get_next(self) -> dict:
        proxy = next(self.proxies)
        return {
            'http': f'http://{proxy}',
            'https': f'http://{proxy}',
        }

# Usage with the scraper
proxies = ProxyRotator([
    'user:pass@proxy1.example.com:8080',
    'user:pass@proxy2.example.com:8080',
    'user:pass@proxy3.example.com:8080',
])

def scrape_with_proxy(url: str):
    proxy = proxies.get_next()
    response = requests.get(url, headers=headers, proxies=proxy, timeout=15)
    return response
Enter fullscreen mode Exit fullscreen mode

Pro tip: Residential proxies work much better than datacenter proxies for Amazon. Datacenter IPs are flagged almost immediately.

Scraping Search Results

Scraping individual product pages is useful, but often you want to scrape search results to find products:

def scrape_search_results(query: str, pages: int = 3) -> list[dict]:
    """Scrape Amazon search results for a given query."""
    products = []
    base_url = 'https://www.amazon.com/s'

    for page in range(1, pages + 1):
        params = {'k': query, 'page': page}
        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.text, 'lxml')

        items = soup.select('[data-component-type="s-search-result"]')

        for item in items:
            asin = item.get('data-asin', '')
            title_el = item.select_one('h2 a span')
            price_el = item.select_one('span.a-price span.a-offscreen')
            rating_el = item.select_one('span.a-icon-alt')

            products.append({
                'asin': asin,
                'title': title_el.text.strip() if title_el else None,
                'price': price_el.text.strip() if price_el else None,
                'rating': rating_el.text.strip() if rating_el else None,
                'url': f'https://www.amazon.com/dp/{asin}',
            })

        # Random delay between pages
        time.sleep(random.uniform(2, 5))

    return products
Enter fullscreen mode Exit fullscreen mode

Extracting Reviews

Reviews are gold for sentiment analysis and product research:

def scrape_reviews(asin: str, pages: int = 5) -> list[dict]:
    """Scrape reviews for a product by ASIN."""
    reviews = []
    base_url = f'https://www.amazon.com/product-reviews/{asin}'

    for page in range(1, pages + 1):
        params = {'pageNumber': page, 'sortBy': 'recent'}
        response = requests.get(base_url, params=params, headers=headers)
        soup = BeautifulSoup(response.text, 'lxml')

        review_elements = soup.select('[data-hook="review"]')

        for review in review_elements:
            title_el = review.select_one('[data-hook="review-title"] span')
            body_el = review.select_one('[data-hook="review-body"] span')
            rating_el = review.select_one('[data-hook="review-star-rating"] span')
            date_el = review.select_one('[data-hook="review-date"]')

            reviews.append({
                'title': title_el.text.strip() if title_el else None,
                'body': body_el.text.strip() if body_el else None,
                'rating': rating_el.text.strip() if rating_el else None,
                'date': date_el.text.strip() if date_el else None,
            })

        time.sleep(random.uniform(3, 6))

    return reviews
Enter fullscreen mode Exit fullscreen mode

The Managed Solution: ScraperAPI

If you're scraping at any real scale — hundreds or thousands of products — managing your own proxies, handling CAPTCHAs, and dealing with blocks gets exhausting fast. I've spent more time debugging proxy issues than writing actual data pipelines.

ScraperAPI handles all of this for you. You send a request through their API, and they handle proxy rotation, CAPTCHA solving, browser fingerprinting, and retries. It's a single API call:

import requests

SCRAPERAPI_KEY = 'your_api_key'

def scrape_with_scraperapi(url: str) -> str:
    """Scrape any URL through ScraperAPI."""
    payload = {
        'api_key': SCRAPERAPI_KEY,
        'url': url,
        'render': 'true',  # Enable JavaScript rendering
    }
    response = requests.get(
        'https://api.scraperapi.com',
        params=payload,
        timeout=60
    )
    return response.text

# Works seamlessly with BeautifulSoup
html = scrape_with_scraperapi('https://www.amazon.com/dp/B0EXAMPLE')
soup = BeautifulSoup(html, 'lxml')
title = soup.select_one('#productTitle').text.strip()
print(title)
Enter fullscreen mode Exit fullscreen mode

They also have a dedicated Amazon endpoint that returns structured JSON — no parsing needed:

def get_amazon_product(asin: str) -> dict:
    """Get structured Amazon product data via ScraperAPI."""
    response = requests.get(
        'https://api.scraperapi.com/structured/amazon/product',
        params={
            'api_key': SCRAPERAPI_KEY,
            'asin': asin,
            'country': 'us',
        }
    )
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Try ScraperAPI free — they offer 5,000 free API credits to get started.

Best Practices for Amazon Scraping

  1. Respect rate limits: Add random delays between requests (2-5 seconds minimum)
  2. Rotate User-Agents: Don't use the same UA string for every request
  3. Use residential proxies: Datacenter IPs get flagged immediately
  4. Handle errors gracefully: Amazon will return 503s and CAPTCHAs — retry with backoff
  5. Cache results: Don't re-scrape data you already have
  6. Monitor your success rate: If it drops below 90%, something is wrong

Storing Your Data

For any serious project, dump your scraped data into a database:

import sqlite3
import json
from datetime import datetime

def save_product(product: dict, db_path: str = 'amazon_data.db'):
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS products (
            asin TEXT PRIMARY KEY,
            title TEXT,
            price TEXT,
            rating TEXT,
            review_count TEXT,
            seller TEXT,
            scraped_at TEXT
        )
    ''')

    cursor.execute('''
        INSERT OR REPLACE INTO products
        VALUES (?, ?, ?, ?, ?, ?, ?)
    ''', (
        product.get('asin'),
        product.get('title'),
        product.get('price'),
        product.get('rating'),
        product.get('review_count'),
        product.get('seller'),
        datetime.now().isoformat(),
    ))

    conn.commit()
    conn.close()
Enter fullscreen mode Exit fullscreen mode

Conclusion

Scraping Amazon in 2026 is definitely doable, but it requires more sophistication than it did a few years ago. For small-scale projects, the DIY approach with rotating proxies works fine. For anything production-grade, a managed service like ScraperAPI will save you significant time and headaches.

The key is to start simple, test your approach, and scale up gradually. Happy scraping!


What's your experience scraping Amazon? Drop your questions or tips in the comments below.

Top comments (0)