agenthustler

Posted on Mar 26

How to Scrape Amazon in 2026: Products, Reviews, Prices, and Rankings

#webdev #python #webscraping #tutorial

Amazon is the world's largest e-commerce platform with over 350 million products. Whether you're tracking competitor prices, monitoring product reviews, or building a price comparison tool, Amazon data is incredibly valuable.

But scraping Amazon in 2026 is harder than ever. Let's break down what works, what doesn't, and how to do it without getting blocked or sued.

Why Scraping Amazon Is Challenging

Anti-Bot Detection

Amazon runs one of the most sophisticated anti-bot systems on the internet. Their defenses include:

CAPTCHA challenges triggered by unusual browsing patterns
IP fingerprinting that tracks request frequency per IP
Browser fingerprinting detecting headless browsers and automation tools
Dynamic page structures where CSS classes and HTML IDs change regularly
Rate limiting that throttles or blocks IPs making too many requests

If you send more than a handful of requests per minute from the same IP, expect to see CAPTCHAs or outright blocks.

Legal Considerations

Amazon's robots.txt explicitly disallows scraping most product pages. Their Terms of Service prohibit automated data collection. While the legal landscape around web scraping has evolved (the hiQ v. LinkedIn ruling established that scraping public data isn't necessarily illegal), Amazon has actively pursued legal action against scrapers.

The safest approach? Use official APIs where possible, and if you must scrape, be respectful — low volume, reasonable delays, and never scrape behind login walls.

Method 1: Amazon Product Advertising API (Official)

The cleanest and most legal approach is Amazon's own Product Advertising API (PA-API 5.0).

What You Get

Product titles, descriptions, and images
Current prices and availability
Customer ratings (aggregate, not individual reviews)
Browse node categories
Search results for keywords

What You Don't Get

Individual customer reviews (text)
Seller information
Historical pricing data
Real-time ranking data beyond bestseller lists

Setup

import requests
import json
import hashlib
import hmac
from datetime import datetime

# You need an Amazon Associates account to get API access
ACCESS_KEY = "your-access-key"
SECRET_KEY = "your-secret-key"
PARTNER_TAG = "your-partner-tag"

def search_amazon(keywords, category="All"):
    payload = {
        "Keywords": keywords,
        "Resources": [
            "ItemInfo.Title",
            "Offers.Listings.Price",
            "Images.Primary.Large",
            "BrowseNodeInfo.BrowseNodes"
        ],
        "SearchIndex": category,
        "ItemCount": 10,
        "PartnerTag": PARTNER_TAG,
        "PartnerType": "Associates",
        "Marketplace": "www.amazon.com"
    }
    # Sign and send the request using PA-API 5.0 signing
    # (Full signing code omitted for brevity — see AWS docs)
    return response.json()

Limitations

PA-API has strict rate limits (1 request per second for new associates, scaling up based on your earnings). You also need an active Amazon Associates account with qualifying sales to maintain access.

Method 2: Third-Party APIs

If the official API doesn't cover your needs, third-party scraping APIs handle the hard parts for you.

Rainforest API

Rainforest API is purpose-built for Amazon data. It handles proxy rotation, CAPTCHA solving, and returns structured JSON:

import requests

params = {
    "api_key": "YOUR_RAINFOREST_KEY",
    "type": "product",
    "asin": "B0BSHF7WHW",
    "amazon_domain": "amazon.com"
}

response = requests.get(
    "https://api.rainforestapi.com/request",
    params=params
)
product = response.json()["product"]
print(f"Title: {product['title']}")
print(f"Price: {product['buybox_winner']['price']['value']}")
print(f"Rating: {product['rating']}")

ScraperAPI

ScraperAPI is a more general-purpose solution that works great for Amazon. It handles proxies, browsers, and CAPTCHAs automatically:

import requests

url = "https://api.scraperapi.com"
params = {
    "api_key": "YOUR_SCRAPERAPI_KEY",
    "url": "https://www.amazon.com/dp/B0BSHF7WHW",
    "render": "true"  # JavaScript rendering for dynamic content
}

response = requests.get(url, params=params)
# Parse the HTML response with BeautifulSoup

ScraperAPI is especially useful if you're scraping multiple sites beyond just Amazon, since the same API works for any website.

Method 3: DIY Scraping With Proxies

If you want full control, you can build your own scraper. But you'll need serious proxy infrastructure to avoid blocks.

Basic Setup

import requests
from bs4 import BeautifulSoup
import time
import random

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml"
}

def scrape_product(asin, proxy=None):
    url = f"https://www.amazon.com/dp/{asin}"
    proxies = {"http": proxy, "https": proxy} if proxy else None

    response = requests.get(
        url,
        headers=HEADERS,
        proxies=proxies,
        timeout=15
    )

    if response.status_code != 200:
        print(f"Blocked or error: {response.status_code}")
        return None

    soup = BeautifulSoup(response.text, "html.parser")

    title = soup.select_one("#productTitle")
    price = soup.select_one(".a-price .a-offscreen")
    rating = soup.select_one("#acrPopover span.a-size-base")

    return {
        "title": title.text.strip() if title else None,
        "price": price.text.strip() if price else None,
        "rating": rating.text.strip() if rating else None
    }

# Always add delays between requests
for asin in ["B0BSHF7WHW", "B0CHX3QBCH"]:
    result = scrape_product(asin)
    print(result)
    time.sleep(random.uniform(3, 7))  # Random delay

The Proxy Problem

Without proxies, you'll get blocked after 10-20 requests. Residential proxies are essential for Amazon scraping at any scale.

ThorData provides residential proxies that work well for e-commerce scraping. Their rotating proxy pool helps distribute requests across different IPs:

# Using ThorData residential proxies
proxy = "http://user:pass@proxy.thordata.com:9090"
result = scrape_product("B0BSHF7WHW", proxy=proxy)

What to Watch Out For

Rotate User-Agents — don't use the same one for every request
Randomize delays — fixed intervals are a fingerprinting signal
Handle CAPTCHAs gracefully — back off when you hit them, don't hammer
Monitor your success rate — if it drops below 80%, slow down
Don't scrape while logged in — that violates ToS more clearly

Method 4: Platform-Based Scraping

Platforms like Apify provide ready-made scraping actors that run in the cloud. While we don't currently have a dedicated Amazon scraper, we do offer eBay and AliExpress scrapers as alternatives for e-commerce data. Apify's marketplace has several Amazon-specific actors built by the community.

The advantage of platform-based scraping is that you don't manage proxies, servers, or browser infrastructure — it's all handled for you.

What Data Can You Actually Get?

Data Point	Official API	Third-Party API	DIY Scraping
Product title & description	✅	✅	✅
Current price	✅	✅	✅
Customer ratings (aggregate)	✅	✅	✅
Individual reviews	❌	✅	✅
Seller info	❌	✅	✅
Historical prices	❌	Some	❌
Search rankings	Limited	✅	✅
Images	✅	✅	✅

Which Method Should You Choose?

Use the official PA-API if:

You only need basic product data (title, price, images, ratings)
You're building an affiliate site and already have an Associates account
You want zero legal risk

Use a third-party API (ScraperAPI, Rainforest) if:

You need review text, seller data, or search rankings
You want structured data without parsing HTML
You're scraping at moderate scale (thousands of products)

Build your own scraper if:

You need full control over what data you collect
You have proxy infrastructure already
You're comfortable maintaining scrapers as Amazon changes their HTML

Final Thoughts

Amazon scraping in 2026 is a cat-and-mouse game. The official API is limited but safe. Third-party APIs are the best balance of data coverage and reliability. DIY scraping gives you the most flexibility but requires ongoing maintenance.

Whatever approach you choose, respect rate limits, add delays between requests, and stay within legal boundaries. The data is valuable — but not worth a lawsuit or a permanent IP ban.

Building scrapers for e-commerce data? Check out our Apify actors for eBay and AliExpress scraping, or use ScraperAPI and ThorData proxies to build your own.

DEV Community