DEV Community

Mai Vy Ly
Mai Vy Ly

Posted on • Edited on

How to Web Scrape Amazon with Python?

Want a fast, practical guide to scraping Amazon product data with Python? Here’s a concise walkthrough using requests + BeautifulSoup, with anti-bot tips, pagination, and clean parsing. For a working reference, check the GitHub repo: https://github.com/maivyly52-gif/amazon-web-scraper-python

What You’ll Learn

  • Send realistic HTTP requests (headers, delays)

  • Parse titles, prices, ratings, URLs with BeautifulSoup

  • Handle pagination safely

  • Reduce blocks with rotating user agents/proxies

  • Know ethical & legal guardrails

Explore the full example code here: https://github.com/maivyly52-gif/amazon-web-scraper-python

pip install requests beautifulsoup4 fake-useragent
Enter fullscreen mode Exit fullscreen mode

(Proxy support? Add httpx/requests[socks] or a provider SDK.)

Core Steps

1) Build a “human-like” request

import time, random, requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {
    "User-Agent": ua.random,
    "Accept-Language": "en-US,en;q=0.9",
}

def fetch(url, *, retries=3, backoff=2):
    for i in range(retries):
        resp = requests.get(url, headers=headers, timeout=20)
        if resp.status_code == 200 and "Robot Check" not in resp.text:
            return resp.text
        time.sleep(backoff * (i + 1) + random.uniform(0.2, 1.1))
    return None

Enter fullscreen mode Exit fullscreen mode

2) Parse product cards

from bs4 import BeautifulSoup

def parse_search(html):
    soup = BeautifulSoup(html, "html.parser")
    items = []
    for card in soup.select("div.s-main-slot div[data-asin][data-component-type='s-search-result']"):
        asin = card.get("data-asin")
        title_el = card.select_one("h2 a span")
        price_whole = card.select_one("span.a-price > span.a-offscreen")
        rating = card.select_one("span.a-icon-alt")
        link_el = card.select_one("h2 a")
        if not (asin and title_el and link_el):
            continue
        items.append({
            "asin": asin,
            "title": title_el.get_text(strip=True),
            "price": price_whole.get_text(strip=True) if price_whole else None,
            "rating": rating.get_text(strip=True) if rating else None,
            "url": f"https://www.amazon.com{link_el['href'].split('?')[0]}",
        })
    return items

Enter fullscreen mode Exit fullscreen mode

3) Walk pagination (carefully)

from urllib.parse import urlencode

def search_amazon(query, pages=1):
    base = "https://www.amazon.com/s"
    results = []
    for page in range(1, pages + 1):
        params = {"k": query, "page": page}
        html = fetch(f"{base}?{urlencode(params)}")
        if not html:
            break
        results.extend(parse_search(html))
        time.sleep(random.uniform(1.2, 3.1))  # be gentle
    return results

if __name__ == "__main__":
    data = search_amazon("wireless earbuds", pages=2)
    for row in data[:5]:
        print(row)

Enter fullscreen mode Exit fullscreen mode

Prefer a ready-to-run example? See the repo’s code paths and notes: https://github.com/maivyly52-gif/amazon-web-scraper-python

Anti-Bot Tips (Reduce Blocks)

  • Rotate User-Agents per request (fake-useragent or a maintained list).

  • Respectful delays (1–5s jitter) and low concurrency.

  • Proxies: residential/mobile work best; rotate IPs and subnets.

  • Fewer parameters in URLs; avoid suspicious patterns.

  • Fallback strategies: try different storefronts or narrower filters when you hit captchas.

You’ll find a compact starter you can adapt in the GitHub project: https://github.com/maivyly52-gif/amazon-web-scraper-python

Data You Can Extract (Typical)

  • Title, price, list price, rating, review count

  • ASIN, product URL, image URL

  • Badges (e.g., “Best Seller”, “Amazon’s Choice”)

  • Availability snippets

Legal & Ethical Notes

  • Check Amazon’s Terms of Use and your local laws before scraping.

  • Prefer official APIs when possible (e.g., Amazon Product Advertising API) for reliability.

  • Don’t overload servers; throttle requests and cache results.

  • Use scraped data only where you have the right to use it.

Next Steps

  • Turn results into CSV/JSON for analysis.

  • Add retry with CAPTCHA detection and proxy rotation.

  • Expand parsing to product detail pages (features, bullets, specs).

Dive deeper, copy the boilerplate, and tweak it for your use case here: https://github.com/maivyly52-gif/amazon-web-scraper-python — and if you find it useful, the repo and explore the code examples in https://github.com/maivyly52-gif/amazon-web-scraper-python

Top comments (0)