Amazon is the world's largest product database — over 350 million products with real-time pricing, millions of customer reviews, and Best Sellers Rank (BSR) data that reveals actual sales velocity. For anyone building price comparison tools, market research platforms, or e-commerce analytics, Amazon data is the foundation.
This guide covers how to extract Amazon product data programmatically in 2026, with production-ready Python code and strategies for handling Amazon's aggressive anti-bot systems.
Understanding Amazon's Data Architecture
Amazon organizes products around ASINs (Amazon Standard Identification Numbers) — unique 10-character alphanumeric identifiers. Every product page, review set, and price history is tied to an ASIN.
Key data points available on each product page:
- Product details: Title, description, bullet points, brand, category
- Pricing: Current price, list price, deal price, Subscribe & Save price
- Best Sellers Rank (BSR): Ranking within category — the closest proxy to actual sales volume
- Reviews: Rating distribution, total count, individual review text
- Variations: Size/color/style variants with their own prices and availability
- Seller information: Sold by, fulfilled by, number of sellers, Buy Box winner
Amazon product URLs follow this structure:
https://www.amazon.com/dp/B0XXXXXXXXX
https://www.amazon.com/gp/product/B0XXXXXXXXX
https://www.amazon.com/Product-Name/dp/B0XXXXXXXXX
All three resolve to the same page. The short /dp/ASIN format is the most reliable for scraping.
Setting Up Your Scraper
Unlike many sites, Amazon product pages are largely server-rendered HTML. This means you can use httpx (or requests) without needing a full browser — which is faster and cheaper:
import httpx
import random
import time
from selectolax.parser import HTMLParser
class AmazonScraper:
def __init__(self):
self.user_agents = [
(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
(
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
]
def _get_headers(self):
return {
"User-Agent": random.choice(self.user_agents),
"Accept": (
"text/html,application/xhtml+xml,"
"application/xml;q=0.9,*/*;q=0.8"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def fetch_page(self, url, retries=3):
for attempt in range(retries):
try:
response = httpx.get(
url,
headers=self._get_headers(),
follow_redirects=True,
timeout=15.0,
)
if response.status_code == 200:
return response.text
if response.status_code == 503:
# CAPTCHA page — back off
time.sleep(random.uniform(10, 20))
continue
except httpx.TimeoutException:
time.sleep(random.uniform(2, 5))
continue
return None
Scraping Product Pages
Product data extraction is mostly about knowing the right CSS selectors. Amazon's HTML is verbose but stable — the #productTitle, #priceblock_ourprice, and similar IDs have been consistent for years:
def parse_product(self, html, asin):
tree = HTMLParser(html)
def text(selector):
node = tree.css_first(selector)
return node.text(strip=True) if node else None
# Extract price — Amazon uses multiple price containers
price = None
price_selectors = [
"span.a-price span.a-offscreen",
"#priceblock_ourprice",
"#priceblock_dealprice",
"span[data-a-color='price'] span.a-offscreen",
]
for sel in price_selectors:
price = text(sel)
if price:
break
# Extract BSR from product details
bsr = None
details_nodes = tree.css("#detailBulletsWrapper_feature_div li")
for node in details_nodes:
node_text = node.text()
if "Best Sellers Rank" in node_text:
bsr = node_text.strip()
break
# Alternative BSR location (table format)
if not bsr:
table_rows = tree.css("#productDetails_detailBulletsTable_techSpec_section tr")
for row in table_rows:
if "Best Sellers Rank" in row.text():
bsr = row.text(strip=True)
break
# Extract rating
rating = text("#acrPopover span.a-size-base")
review_count = text("#acrCustomerReviewText")
return {
"asin": asin,
"title": text("#productTitle"),
"price": price,
"rating": rating,
"review_count": review_count,
"bsr": bsr,
"brand": text("#bylineInfo"),
"availability": text("#availability span"),
"bullet_points": [
li.text(strip=True)
for li in tree.css("#feature-bullets li span")
],
}
def scrape_product(self, asin):
url = f"https://www.amazon.com/dp/{asin}"
html = self.fetch_page(url)
if not html:
return None
return self.parse_product(html, asin)
Extracting Reviews at Scale
Reviews live on a separate URL structure and support pagination. Each page shows 10 reviews by default:
def scrape_reviews(self, asin, max_pages=10):
reviews = []
for page in range(1, max_pages + 1):
url = (
f"https://www.amazon.com/product-reviews/{asin}"
f"?pageNumber={page}&sortBy=recent"
)
html = self.fetch_page(url)
if not html:
break
tree = HTMLParser(html)
review_divs = tree.css('[data-hook="review"]')
if not review_divs:
break # No more reviews
for div in review_divs:
title_node = div.css_first('[data-hook="review-title"]')
body_node = div.css_first('[data-hook="review-body"]')
rating_node = div.css_first('[data-hook="review-star-rating"]')
date_node = div.css_first('[data-hook="review-date"]')
verified_node = div.css_first(
'[data-hook="avp-badge"]'
)
review = {
"asin": asin,
"title": (
title_node.text(strip=True)
if title_node else None
),
"body": (
body_node.text(strip=True)
if body_node else None
),
"rating": (
rating_node.text(strip=True)[:3]
if rating_node else None
),
"date": (
date_node.text(strip=True)
if date_node else None
),
"verified": verified_node is not None,
}
reviews.append(review)
# Respectful delay between pages
time.sleep(random.uniform(2.0, 5.0))
return reviews
Tracking Prices and BSR Over Time
Single snapshots are useful, but the real value is in time-series data. Here's a lightweight approach using SQLite:
import sqlite3
from datetime import datetime
class PriceTracker:
def __init__(self, db_path="amazon_prices.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS price_history (
asin TEXT,
price REAL,
bsr TEXT,
rating TEXT,
review_count TEXT,
timestamp TEXT,
PRIMARY KEY (asin, timestamp)
)
""")
self.conn.commit()
def record(self, product_data):
price_str = product_data.get("price", "")
price_val = None
if price_str:
cleaned = price_str.replace("$", "").replace(",", "")
try:
price_val = float(cleaned)
except ValueError:
pass
self.conn.execute(
"INSERT OR REPLACE INTO price_history VALUES (?, ?, ?, ?, ?, ?)",
(
product_data["asin"],
price_val,
product_data.get("bsr"),
product_data.get("rating"),
product_data.get("review_count"),
datetime.utcnow().isoformat(),
),
)
self.conn.commit()
def get_history(self, asin, days=30):
cursor = self.conn.execute(
"""
SELECT price, bsr, timestamp
FROM price_history
WHERE asin = ?
AND timestamp > datetime('now', ?)
ORDER BY timestamp
""",
(asin, f"-{days} days"),
)
return cursor.fetchall()
Run your scraper on a schedule (cron, Apify schedules, or a simple loop) to build up historical data that reveals pricing patterns and sales trends.
Handling Amazon's Anti-Bot Defenses
Amazon has some of the most sophisticated bot detection on the web. Here's what works in 2026:
1. Request Pacing
Never hit Amazon faster than one request every 2-3 seconds from the same IP. Burst traffic is the fastest way to get blocked:
import time
import random
def respectful_delay():
"""Wait 2-5 seconds between requests."""
time.sleep(random.uniform(2.0, 5.0))
2. Proxy Rotation
For any serious volume, residential proxies are essential. Datacenter IPs get flagged quickly:
def fetch_with_proxy(self, url, proxy):
return httpx.get(
url,
headers=self._get_headers(),
proxy=proxy,
follow_redirects=True,
timeout=15.0,
)
3. CAPTCHA Detection
When Amazon returns a 503 or a page containing "Enter the characters you see below", you've been flagged. Back off significantly:
def is_captcha_page(self, html):
captcha_signals = [
"Enter the characters you see below",
"api-services-support@amazon.com",
"Type the characters you see in this image",
]
return any(signal in html for signal in captcha_signals)
4. Session Management
Rotate sessions (cookies + user agent) every 50-100 requests. Amazon tracks behavioral fingerprints across requests.
Production-Ready Alternative
If you're building a product that depends on Amazon data, maintaining your own scraping infrastructure is a significant ongoing commitment. Proxy costs, selector maintenance, CAPTCHA handling, and anti-bot countermeasures add up fast.
Our Amazon Scraper on Apify handles all of this out of the box — structured JSON output for products, reviews, pricing, and BSR data with built-in proxy management and automatic retries:
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("cryptosignals/amazon-scraper").call(
run_input={
"searchTerms": ["wireless headphones"],
"maxResults": 50,
"includeReviews": True,
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item['title']} - {item['price']} - BSR: {item['bsr']}")
This lets you focus on building your analytics or pricing tool rather than fighting Amazon's bot detection.
Legitimate Use Cases
Price Comparison Engines: Monitor prices across Amazon and competitors to surface the best deals. This is the foundation of services like CamelCamelCamel and Keepa.
Market Research and Product Validation: Before launching a product, analyze the competitive landscape — pricing, review sentiment, BSR trends, and market gaps. BSR data is particularly valuable because it's the best publicly available proxy for actual sales volume.
Competitive Intelligence: Track competitor products' pricing changes, new review patterns, and ranking movements. A sudden BSR improvement might indicate a successful ad campaign or viral moment.
Review Analysis and Sentiment Monitoring: Aggregate and analyze review text to identify common complaints, feature requests, and quality issues across product categories.
Inventory and Availability Monitoring: Track stock levels and availability patterns for supply chain analysis or arbitrage opportunities.
Academic Research: E-commerce researchers study pricing dynamics, review authenticity, marketplace competition, and consumer behavior using Amazon data.
Conclusion
Amazon scraping in 2026 is technically straightforward — the data is in server-rendered HTML, the URL structure is predictable, and the selectors are stable. The challenge is scale: Amazon's anti-bot systems are world-class, and maintaining reliable extraction at volume requires proxy infrastructure, session management, and CAPTCHA handling.
For prototyping and small-scale research, the code examples in this guide will get you started. For production workloads where reliability matters, consider a managed platform that handles the infrastructure complexity.
The e-commerce data market continues to grow. Whether you're building a price tracker, a market research tool, or a competitive intelligence platform, programmatic access to Amazon's product catalog is a powerful foundation.
Top comments (0)