Introduction
Amazon remains the largest e-commerce platform in the world, and extracting product data from it — prices, reviews, seller information — is one of the most common web scraping use cases. Whether you're building a price monitoring tool, doing competitive analysis, or conducting e-commerce research, knowing how to scrape Amazon effectively in 2026 is a valuable skill.
In this guide, I'll walk you through scraping Amazon product data using Python, covering the real challenges you'll face and practical solutions that actually work.
Why Scrape Amazon?
There are plenty of legitimate reasons to scrape Amazon product data:
- Price monitoring: Track competitor pricing across thousands of products
- Market research: Analyze product trends, review sentiment, and category performance
- Competitive analysis: Monitor new sellers, pricing strategies, and product launches
- Academic research: Study consumer behavior, pricing dynamics, and marketplace economics
The Challenges of Scraping Amazon in 2026
Before we dive into code, let's be honest about what you're up against:
- Aggressive bot detection: Amazon uses sophisticated fingerprinting, CAPTCHAs, and behavioral analysis
- Dynamic content: Many product pages load data via JavaScript
- Rate limiting: Too many requests from one IP will get you blocked fast
- Changing HTML structure: Amazon frequently updates their page layouts
Setting Up Your Environment
First, install the required packages:
pip install requests beautifulsoup4 lxml
Basic Amazon Scraper
Here's a straightforward scraper that extracts product data from an Amazon product page:
import requests
from bs4 import BeautifulSoup
import time
import random
class AmazonScraper:
def __init__(self):
self.session = requests.Session()
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/124.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
def scrape_product(self, url: str) -> dict:
"""Scrape a single Amazon product page."""
response = self.session.get(url, headers=self.headers)
if response.status_code != 200:
print(f"Failed to fetch {url}: {response.status_code}")
return {}
soup = BeautifulSoup(response.text, 'lxml')
product = {
'title': self._get_title(soup),
'price': self._get_price(soup),
'rating': self._get_rating(soup),
'review_count': self._get_review_count(soup),
'seller': self._get_seller(soup),
'availability': self._get_availability(soup),
}
return product
def _get_title(self, soup):
el = soup.select_one('#productTitle')
return el.text.strip() if el else None
def _get_price(self, soup):
selectors = [
'span.a-price span.a-offscreen',
'#priceblock_ourprice',
'#priceblock_dealprice',
'span.a-price-whole',
]
for sel in selectors:
el = soup.select_one(sel)
if el:
return el.text.strip()
return None
def _get_rating(self, soup):
el = soup.select_one('#acrPopover')
if el:
title = el.get('title', '')
return title.split(' ')[0] if title else None
return None
def _get_review_count(self, soup):
el = soup.select_one('#acrCustomerReviewText')
return el.text.strip() if el else None
def _get_seller(self, soup):
el = soup.select_one('#sellerProfileTriggerId')
return el.text.strip() if el else 'Amazon'
def _get_availability(self, soup):
el = soup.select_one('#availability span')
return el.text.strip() if el else None
Adding Proxy Rotation
Here's the thing — the basic scraper above will work for a few requests, then Amazon will block you. You need proxy rotation for any serious scraping work.
import itertools
class ProxyRotator:
def __init__(self, proxy_list: list[str]):
self.proxies = itertools.cycle(proxy_list)
def get_next(self) -> dict:
proxy = next(self.proxies)
return {
'http': f'http://{proxy}',
'https': f'http://{proxy}',
}
# Usage with the scraper
proxies = ProxyRotator([
'user:pass@proxy1.example.com:8080',
'user:pass@proxy2.example.com:8080',
'user:pass@proxy3.example.com:8080',
])
def scrape_with_proxy(url: str):
proxy = proxies.get_next()
response = requests.get(url, headers=headers, proxies=proxy, timeout=15)
return response
Pro tip: Residential proxies work much better than datacenter proxies for Amazon. Datacenter IPs are flagged almost immediately.
Scraping Search Results
Scraping individual product pages is useful, but often you want to scrape search results to find products:
def scrape_search_results(query: str, pages: int = 3) -> list[dict]:
"""Scrape Amazon search results for a given query."""
products = []
base_url = 'https://www.amazon.com/s'
for page in range(1, pages + 1):
params = {'k': query, 'page': page}
response = requests.get(base_url, params=params, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
items = soup.select('[data-component-type="s-search-result"]')
for item in items:
asin = item.get('data-asin', '')
title_el = item.select_one('h2 a span')
price_el = item.select_one('span.a-price span.a-offscreen')
rating_el = item.select_one('span.a-icon-alt')
products.append({
'asin': asin,
'title': title_el.text.strip() if title_el else None,
'price': price_el.text.strip() if price_el else None,
'rating': rating_el.text.strip() if rating_el else None,
'url': f'https://www.amazon.com/dp/{asin}',
})
# Random delay between pages
time.sleep(random.uniform(2, 5))
return products
Extracting Reviews
Reviews are gold for sentiment analysis and product research:
def scrape_reviews(asin: str, pages: int = 5) -> list[dict]:
"""Scrape reviews for a product by ASIN."""
reviews = []
base_url = f'https://www.amazon.com/product-reviews/{asin}'
for page in range(1, pages + 1):
params = {'pageNumber': page, 'sortBy': 'recent'}
response = requests.get(base_url, params=params, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
review_elements = soup.select('[data-hook="review"]')
for review in review_elements:
title_el = review.select_one('[data-hook="review-title"] span')
body_el = review.select_one('[data-hook="review-body"] span')
rating_el = review.select_one('[data-hook="review-star-rating"] span')
date_el = review.select_one('[data-hook="review-date"]')
reviews.append({
'title': title_el.text.strip() if title_el else None,
'body': body_el.text.strip() if body_el else None,
'rating': rating_el.text.strip() if rating_el else None,
'date': date_el.text.strip() if date_el else None,
})
time.sleep(random.uniform(3, 6))
return reviews
The Managed Solution: ScraperAPI
If you're scraping at any real scale — hundreds or thousands of products — managing your own proxies, handling CAPTCHAs, and dealing with blocks gets exhausting fast. I've spent more time debugging proxy issues than writing actual data pipelines.
ScraperAPI handles all of this for you. You send a request through their API, and they handle proxy rotation, CAPTCHA solving, browser fingerprinting, and retries. It's a single API call:
import requests
SCRAPERAPI_KEY = 'your_api_key'
def scrape_with_scraperapi(url: str) -> str:
"""Scrape any URL through ScraperAPI."""
payload = {
'api_key': SCRAPERAPI_KEY,
'url': url,
'render': 'true', # Enable JavaScript rendering
}
response = requests.get(
'https://api.scraperapi.com',
params=payload,
timeout=60
)
return response.text
# Works seamlessly with BeautifulSoup
html = scrape_with_scraperapi('https://www.amazon.com/dp/B0EXAMPLE')
soup = BeautifulSoup(html, 'lxml')
title = soup.select_one('#productTitle').text.strip()
print(title)
They also have a dedicated Amazon endpoint that returns structured JSON — no parsing needed:
def get_amazon_product(asin: str) -> dict:
"""Get structured Amazon product data via ScraperAPI."""
response = requests.get(
'https://api.scraperapi.com/structured/amazon/product',
params={
'api_key': SCRAPERAPI_KEY,
'asin': asin,
'country': 'us',
}
)
return response.json()
Try ScraperAPI free — they offer 5,000 free API credits to get started.
Best Practices for Amazon Scraping
- Respect rate limits: Add random delays between requests (2-5 seconds minimum)
- Rotate User-Agents: Don't use the same UA string for every request
- Use residential proxies: Datacenter IPs get flagged immediately
- Handle errors gracefully: Amazon will return 503s and CAPTCHAs — retry with backoff
- Cache results: Don't re-scrape data you already have
- Monitor your success rate: If it drops below 90%, something is wrong
Storing Your Data
For any serious project, dump your scraped data into a database:
import sqlite3
import json
from datetime import datetime
def save_product(product: dict, db_path: str = 'amazon_data.db'):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
asin TEXT PRIMARY KEY,
title TEXT,
price TEXT,
rating TEXT,
review_count TEXT,
seller TEXT,
scraped_at TEXT
)
''')
cursor.execute('''
INSERT OR REPLACE INTO products
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
product.get('asin'),
product.get('title'),
product.get('price'),
product.get('rating'),
product.get('review_count'),
product.get('seller'),
datetime.now().isoformat(),
))
conn.commit()
conn.close()
Conclusion
Scraping Amazon in 2026 is definitely doable, but it requires more sophistication than it did a few years ago. For small-scale projects, the DIY approach with rotating proxies works fine. For anything production-grade, a managed service like ScraperAPI will save you significant time and headaches.
The key is to start simple, test your approach, and scale up gradually. Happy scraping!
What's your experience scraping Amazon? Drop your questions or tips in the comments below.
Top comments (0)