The Messy Beginning
Like many developers, I started Amazon scraping with a single Python file. What began as a "quick script" quickly turned into 500+ lines of spaghetti code:
# The old way - amazon_spider_old.py (don't do this!)
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from fake_useragent import UserAgent
# 50+ lines of proxy management
# 100+ lines of parsing logic
# 200+ lines of error handling
# 150+ lines of data export
# Complete nightmare to maintain!
Sound familiar? Here's how I transformed this mess into a clean, maintainable Scrapy project.
The Pain Points of DIY Scraping
Problem #1: IP Blocks Everywhere
# My hacky solution (don't judge!)
proxies = [
"http://proxy1:8000",
"http://proxy2:8000",
# ... manually managing 20+ proxies
]
for i in range(len(proxies)):
try:
response = requests.get(url, proxies={"http": proxies[i]})
if response.status_code == 200:
break
except:
continue # Prayer-driven development π
Problem #2: Fragile Selectors
# This broke every week
try:
price = soup.select('.a-price-whole')[0].text
except:
try:
price = soup.select('.a-offscreen')[0].text
except:
try:
price = soup.select('.a-price')[0].text
except:
price = "Not found" # π
Problem #3: Zero Monitoring
When scraping failed (and it failed often), I had no idea why:
- Was it the proxy?
- Did Amazon change their layout?
- Network issues?
- Rate limiting?
Debugging was pure guesswork.
The Scrapy Transformation
After one too many 3 AM debugging sessions, I decided to rebuild using Scrapy. Here's the before/after:
Before: Single Monolithic File
amazon_spider_old.py (534 lines)
βββ Proxy management (manual)
βββ User agent rotation (manual)
βββ Request retries (manual)
βββ Data parsing (fragile)
βββ CSV export (buggy)
βββ Error handling (prayer-based)
After: Clean Scrapy Architecture
amazon-scrapy-scraper/
βββ amazon_scraper/
β βββ spiders/
β β βββ amazon_search.py (72 lines)
β β βββ amazon_product.py (78 lines)
β βββ settings.py (31 lines)
β βββ data/ (auto-generated)
βββ requirements.txt (3 lines)
βββ README.md (comprehensive docs)
Result: 500+ lines β 150 lines total! π
Key Architectural Decisions
1. Single Responsibility Spiders
Instead of one spider doing everything, I created focused spiders:
Search Spider (amazon_search.py
):
class AmazonSearchSpider(scrapy.Spider):
name = 'amazon_search'
def start_requests(self):
keyword_list = ['ipad']
for keyword in keyword_list:
url = f'https://www.amazon.com/s?k={keyword}'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
# Extract search results
products = response.css('div[data-component-type="s-search-result"]')
for product in products:
yield {
'title': product.css('h2 a span::text').get(),
'price': product.css('.a-price-whole::text').get(),
'url': product.css('h2 a::attr(href)').get(),
}
Product Spider (amazon_product.py
):
class AmazonProductSpider(scrapy.Spider):
name = 'amazon_product'
def parse_product_data(self, response):
# Detailed product extraction
yield {
"name": response.css("#productTitle::text").get("").strip(),
"price": self.extract_price(response),
"features": self.extract_features(response),
"images": self.extract_images(response),
}
2. Configuration Over Code
All settings in one place:
# settings.py
BOT_NAME = 'amazon_scraper'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1
# ScrapeOps Integration (game changer!)
SCRAPEOPS_API_KEY = 'your-api-key'
SCRAPEOPS_PROXY_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}
# Auto-export to CSV
FEEDS = {
'data/%(name)s_%(time)s.csv': {'format': 'csv'}
}
3. Built-in Superpowers
Scrapy + ScrapeOps gives you so much for free:
Scrapy provides:
- Automatic retries with exponential backoff
- Concurrent requests with rate limiting
- Robust selectors with fallbacks
- Multiple export formats (CSV, JSON, XML)
- Middleware system for custom logic
ScrapeOps provides:
- Proxy rotation and IP management
- Real-time monitoring dashboard
- Request success tracking
The ScrapeOps Game Changer
The biggest improvement wasn't Scrapy itselfβit was integrating ScrapeOps:
Before: Manual Proxy Hell
# 50+ lines of proxy management code
proxies = load_proxy_list()
current_proxy = 0
def get_next_proxy():
global current_proxy
if current_proxy >= len(proxies):
current_proxy = 0
time.sleep(60) # Wait and pray
proxy = proxies[current_proxy]
current_proxy += 1
return proxy
After: One Line Integration
# settings.py
SCRAPEOPS_PROXY_ENABLED = True
That's it! ScrapeOps handles:
- Proxy rotation
- IP geolocation
- User agent rotation
- Request success monitoring
- Automatic retries
Monitoring Dashboard
The ScrapeOps dashboard shows real-time metrics:
- Request success rates
- Response times
- Error patterns
- Bandwidth usage
So.. No more guessing what went wrong.
Performance Comparison
Metric | Old Script | New Scrapy Project |
---|---|---|
Lines of Code | 534 | 150 |
Success Rate | ~70% | ~95% |
Maintainability | Nightmare | Easy |
Debugging Time | Hours | Minutes |
Scalability | Limited | Unlimited |
Monitoring | None | Full dashboard |
Real-World Usage Examples
Quick Product Search
# Search for iPads
scrapy crawl amazon_search
# Output: data/amazon_search_2024-01-15T10-30-45.csv
Detailed Product Analysis
# Get comprehensive product data
scrapy crawl amazon_product
# Output includes features, images, variants
Custom Keywords
# Edit amazon_search.py
keyword_list = ['mechanical keyboards', 'gaming mice', 'monitors']
Lessons for Fellow Developers
1. Don't Reinvent the Wheel
I wasted months building retry logic and export functionality that Scrapy provides out of the box, plus proxy rotation that ScrapeOps handles.
2. Monitoring is Essential
Without ScrapeOps monitoring, I was flying blind. Now I can optimize based on real data.
3. Architecture Matters
Two focused spiders are infinitely more maintainable than one giant spider.
4. Start with Free Tiers
ScrapeOps free tier (1,000 requests/month) is perfect for development and small projects.
Getting Started (5 Minutes Setup)
- Clone and setup:
git clone https://github.com/Simple-Python-Scrapy-Scrapers/amazon-scrapy-scraper.git
cd amazon-scrapy-scraper
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\Activate.ps1 on Windows
pip install -r requirements.txt
-
Get free ScrapeOps API key:
- Visit ScrapeOps.io
- 1,000 free requests/month
Configure and run:
# Add API key to amazon_scraper/settings.py
cd amazon_scraper
scrapy crawl amazon_search
What's Next?
This architecture scales beautifully. I'm already planning:
- Review scraper for sentiment analysis
- Price tracking with alerts
- Multi-marketplace support (eBay, Walmart)
- API endpoints for real-time data
The clean foundation makes adding features straightforward.
Repository & Resources
π Full source code: amazon-scrapy-scraper
π Helpful resources:
Discussion
Have you refactored similar projects? What architectural patterns work best for web scraping?
Drop a comment with:
- Your biggest scraping pain points
- Scrapy tips and tricks
- Alternative tools you've tried
Let's help each other build better scrapers!
P.S. - Web scraping should always respect robots.txt and rate limits. This project is for educational purposes.
Top comments (2)
I'm seriously getting into Python. Last week I recoded a 2200 line Golang utility into 200 lines of Python and my Python loving workmate won't stop saying, "Told you so!". I hate him so much right now for being right these past 2 years, ha ha!!!
haha, love this! nothing like having a workmate _gloat _ for two yearsβ¦ and then realizing they were onto something all along! I totally get the feeling - my early scraping scripts felt like badge-of-honor messes until I bit the bullet and rebuilt everything in Python/Scrapy. Itβs wild how much cleaner and shorter the code can get.
What surprised you most when switching from Go to Python? Anything you miss from Go, or are you fully converted now?