Noorsimar Singh

Posted on Jun 25

From 500+ Lines to 150: How I Refactored My Amazon Scraper Into a Clean Scrapy Project

#webscraping #scrapy #scrapeops #python

The Messy Beginning

Like many developers, I started Amazon scraping with a single Python file. What began as a "quick script" quickly turned into 500+ lines of spaghetti code:

# The old way - amazon_spider_old.py (don't do this!)
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from fake_useragent import UserAgent

# 50+ lines of proxy management
# 100+ lines of parsing logic
# 200+ lines of error handling
# 150+ lines of data export
# Complete nightmare to maintain!

Sound familiar? Here's how I transformed this mess into a clean, maintainable Scrapy project.

The Pain Points of DIY Scraping

Problem #1: IP Blocks Everywhere

# My hacky solution (don't judge!)
proxies = [
    "http://proxy1:8000",
    "http://proxy2:8000", 
    # ... manually managing 20+ proxies
]

for i in range(len(proxies)):
    try:
        response = requests.get(url, proxies={"http": proxies[i]})
        if response.status_code == 200:
            break
    except:
        continue  # Prayer-driven development 🙏

Problem #2: Fragile Selectors

# This broke every week
try:
    price = soup.select('.a-price-whole')[0].text
except:
    try:
        price = soup.select('.a-offscreen')[0].text
    except:
        try:
            price = soup.select('.a-price')[0].text
        except:
            price = "Not found"  # 😭

Problem #3: Zero Monitoring

When scraping failed (and it failed often), I had no idea why:

Was it the proxy?
Did Amazon change their layout?
Network issues?
Rate limiting?

Debugging was pure guesswork.

The Scrapy Transformation

After one too many 3 AM debugging sessions, I decided to rebuild using Scrapy. Here's the before/after:

Before: Single Monolithic File

amazon_spider_old.py (534 lines)
├── Proxy management (manual) 
├── User agent rotation (manual)
├── Request retries (manual)
├── Data parsing (fragile)
├── CSV export (buggy)
└── Error handling (prayer-based)

After: Clean Scrapy Architecture

amazon-scrapy-scraper/
├── amazon_scraper/
│   ├── spiders/
│   │   ├── amazon_search.py      (72 lines)
│   │   └── amazon_product.py     (78 lines)
│   ├── settings.py               (31 lines)
│   └── data/                     (auto-generated)
├── requirements.txt              (3 lines)
└── README.md                     (comprehensive docs)

Result: 500+ lines → 150 lines total! 🎉

Key Architectural Decisions

1. Single Responsibility Spiders

Instead of one spider doing everything, I created focused spiders:

Search Spider (amazon_search.py):

class AmazonSearchSpider(scrapy.Spider):
    name = 'amazon_search'

    def start_requests(self):
        keyword_list = ['ipad']
        for keyword in keyword_list:
            url = f'https://www.amazon.com/s?k={keyword}'
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Extract search results
        products = response.css('div[data-component-type="s-search-result"]')
        for product in products:
            yield {
                'title': product.css('h2 a span::text').get(),
                'price': product.css('.a-price-whole::text').get(),
                'url': product.css('h2 a::attr(href)').get(),
            }

Product Spider (amazon_product.py):

class AmazonProductSpider(scrapy.Spider):
    name = 'amazon_product'

    def parse_product_data(self, response):
        # Detailed product extraction
        yield {
            "name": response.css("#productTitle::text").get("").strip(),
            "price": self.extract_price(response),
            "features": self.extract_features(response),
            "images": self.extract_images(response),
        }

2. Configuration Over Code

All settings in one place:

# settings.py
BOT_NAME = 'amazon_scraper'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1

# ScrapeOps Integration (game changer!)
SCRAPEOPS_API_KEY = 'your-api-key'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

# Auto-export to CSV
FEEDS = {
    'data/%(name)s_%(time)s.csv': {'format': 'csv'}
}

3. Built-in Superpowers

Scrapy + ScrapeOps gives you so much for free:

Scrapy provides:

Automatic retries with exponential backoff
Concurrent requests with rate limiting
Robust selectors with fallbacks
Multiple export formats (CSV, JSON, XML)
Middleware system for custom logic

ScrapeOps provides:

Proxy rotation and IP management
Real-time monitoring dashboard
Request success tracking

The ScrapeOps Game Changer

The biggest improvement wasn't Scrapy itself—it was integrating ScrapeOps:

Before: Manual Proxy Hell

# 50+ lines of proxy management code
proxies = load_proxy_list()
current_proxy = 0

def get_next_proxy():
    global current_proxy
    if current_proxy >= len(proxies):
        current_proxy = 0
        time.sleep(60)  # Wait and pray
    proxy = proxies[current_proxy]
    current_proxy += 1
    return proxy

After: One Line Integration

# settings.py
SCRAPEOPS_PROXY_ENABLED = True

That's it! ScrapeOps handles:

Proxy rotation
IP geolocation
User agent rotation
Request success monitoring
Automatic retries

Monitoring Dashboard

The ScrapeOps dashboard shows real-time metrics:

Request success rates
Response times
Error patterns
Bandwidth usage

So.. No more guessing what went wrong.

Performance Comparison

Metric	Old Script	New Scrapy Project
Lines of Code	534	150
Success Rate	~70%	~95%
Maintainability	Nightmare	Easy
Debugging Time	Hours	Minutes
Scalability	Limited	Unlimited
Monitoring	None	Full dashboard

Real-World Usage Examples

Quick Product Search

# Search for iPads
scrapy crawl amazon_search

# Output: data/amazon_search_2024-01-15T10-30-45.csv

Detailed Product Analysis

# Get comprehensive product data
scrapy crawl amazon_product

# Output includes features, images, variants

Custom Keywords

# Edit amazon_search.py
keyword_list = ['mechanical keyboards', 'gaming mice', 'monitors']

Lessons for Fellow Developers

1. Don't Reinvent the Wheel

I wasted months building retry logic and export functionality that Scrapy provides out of the box, plus proxy rotation that ScrapeOps handles.

2. Monitoring is Essential

Without ScrapeOps monitoring, I was flying blind. Now I can optimize based on real data.

3. Architecture Matters

Two focused spiders are infinitely more maintainable than one giant spider.

4. Start with Free Tiers

ScrapeOps free tier (1,000 requests/month) is perfect for development and small projects.

Getting Started (5 Minutes Setup)

Clone and setup:

   git clone https://github.com/Simple-Python-Scrapy-Scrapers/amazon-scrapy-scraper.git
   cd amazon-scrapy-scraper
   python -m venv .venv
   source .venv/bin/activate  # or .venv\Scripts\Activate.ps1 on Windows
   pip install -r requirements.txt

Get free ScrapeOps API key:
- Visit ScrapeOps.io
- 1,000 free requests/month
Configure and run:

   # Add API key to amazon_scraper/settings.py
   cd amazon_scraper
   scrapy crawl amazon_search

What's Next?

This architecture scales beautifully. I'm already planning:

Review scraper for sentiment analysis
Price tracking with alerts
Multi-marketplace support (eBay, Walmart)
API endpoints for real-time data

The clean foundation makes adding features straightforward.

Repository & Resources

🔗 Full source code: amazon-scrapy-scraper

📚 Helpful resources:

Discussion

Have you refactored similar projects? What architectural patterns work best for web scraping?

Drop a comment with:

Your biggest scraping pain points
Scrapy tips and tricks
Alternative tools you've tried

Let's help each other build better scrapers!

P.S. - Web scraping should always respect robots.txt and rate limits. This project is for educational purposes.

Top comments (2)

George Johnson • Jun 25

I'm seriously getting into Python. Last week I recoded a 2200 line Golang utility into 200 lines of Python and my Python loving workmate won't stop saying, "Told you so!". I hate him so much right now for being right these past 2 years, ha ha!!!

Noorsimar Singh • Jul 7 • Edited

haha, love this! nothing like having a workmate _gloat _ for two years… and then realizing they were onto something all along! I totally get the feeling - my early scraping scripts felt like badge-of-honor messes until I bit the bullet and rebuilt everything in Python/Scrapy. It’s wild how much cleaner and shorter the code can get.

What surprised you most when switching from Go to Python? Anything you miss from Go, or are you fully converted now?