DEV Community

Cover image for From 500+ Lines to 150: How I Refactored My Amazon Scraper Into a Clean Scrapy Project
Noorsimar Singh
Noorsimar Singh

Posted on

From 500+ Lines to 150: How I Refactored My Amazon Scraper Into a Clean Scrapy Project

The Messy Beginning

Like many developers, I started Amazon scraping with a single Python file. What began as a "quick script" quickly turned into 500+ lines of spaghetti code:

# The old way - amazon_spider_old.py (don't do this!)
import requests
from bs4 import BeautifulSoup
import csv
import time
import random
from fake_useragent import UserAgent

# 50+ lines of proxy management
# 100+ lines of parsing logic
# 200+ lines of error handling
# 150+ lines of data export
# Complete nightmare to maintain!
Enter fullscreen mode Exit fullscreen mode

Sound familiar? Here's how I transformed this mess into a clean, maintainable Scrapy project.


The Pain Points of DIY Scraping

Problem #1: IP Blocks Everywhere

# My hacky solution (don't judge!)
proxies = [
    "http://proxy1:8000",
    "http://proxy2:8000", 
    # ... manually managing 20+ proxies
]

for i in range(len(proxies)):
    try:
        response = requests.get(url, proxies={"http": proxies[i]})
        if response.status_code == 200:
            break
    except:
        continue  # Prayer-driven development πŸ™
Enter fullscreen mode Exit fullscreen mode

Problem #2: Fragile Selectors

# This broke every week
try:
    price = soup.select('.a-price-whole')[0].text
except:
    try:
        price = soup.select('.a-offscreen')[0].text
    except:
        try:
            price = soup.select('.a-price')[0].text
        except:
            price = "Not found"  # 😭
Enter fullscreen mode Exit fullscreen mode

Problem #3: Zero Monitoring

When scraping failed (and it failed often), I had no idea why:

  • Was it the proxy?
  • Did Amazon change their layout?
  • Network issues?
  • Rate limiting?

Debugging was pure guesswork.


The Scrapy Transformation

After one too many 3 AM debugging sessions, I decided to rebuild using Scrapy. Here's the before/after:

Before: Single Monolithic File

amazon_spider_old.py (534 lines)
β”œβ”€β”€ Proxy management (manual) 
β”œβ”€β”€ User agent rotation (manual)
β”œβ”€β”€ Request retries (manual)
β”œβ”€β”€ Data parsing (fragile)
β”œβ”€β”€ CSV export (buggy)
└── Error handling (prayer-based)
Enter fullscreen mode Exit fullscreen mode

After: Clean Scrapy Architecture

amazon-scrapy-scraper/
β”œβ”€β”€ amazon_scraper/
β”‚   β”œβ”€β”€ spiders/
β”‚   β”‚   β”œβ”€β”€ amazon_search.py      (72 lines)
β”‚   β”‚   └── amazon_product.py     (78 lines)
β”‚   β”œβ”€β”€ settings.py               (31 lines)
β”‚   └── data/                     (auto-generated)
β”œβ”€β”€ requirements.txt              (3 lines)
└── README.md                     (comprehensive docs)
Enter fullscreen mode Exit fullscreen mode

Result: 500+ lines β†’ 150 lines total! πŸŽ‰


Key Architectural Decisions

1. Single Responsibility Spiders

Instead of one spider doing everything, I created focused spiders:

Search Spider (amazon_search.py):

class AmazonSearchSpider(scrapy.Spider):
    name = 'amazon_search'

    def start_requests(self):
        keyword_list = ['ipad']
        for keyword in keyword_list:
            url = f'https://www.amazon.com/s?k={keyword}'
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # Extract search results
        products = response.css('div[data-component-type="s-search-result"]')
        for product in products:
            yield {
                'title': product.css('h2 a span::text').get(),
                'price': product.css('.a-price-whole::text').get(),
                'url': product.css('h2 a::attr(href)').get(),
            }
Enter fullscreen mode Exit fullscreen mode

Product Spider (amazon_product.py):

class AmazonProductSpider(scrapy.Spider):
    name = 'amazon_product'

    def parse_product_data(self, response):
        # Detailed product extraction
        yield {
            "name": response.css("#productTitle::text").get("").strip(),
            "price": self.extract_price(response),
            "features": self.extract_features(response),
            "images": self.extract_images(response),
        }
Enter fullscreen mode Exit fullscreen mode

2. Configuration Over Code

All settings in one place:

# settings.py
BOT_NAME = 'amazon_scraper'
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1

# ScrapeOps Integration (game changer!)
SCRAPEOPS_API_KEY = 'your-api-key'
SCRAPEOPS_PROXY_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}

# Auto-export to CSV
FEEDS = {
    'data/%(name)s_%(time)s.csv': {'format': 'csv'}
}
Enter fullscreen mode Exit fullscreen mode

3. Built-in Superpowers

Scrapy + ScrapeOps gives you so much for free:

Scrapy provides:

  • Automatic retries with exponential backoff
  • Concurrent requests with rate limiting
  • Robust selectors with fallbacks
  • Multiple export formats (CSV, JSON, XML)
  • Middleware system for custom logic

ScrapeOps provides:

  • Proxy rotation and IP management
  • Real-time monitoring dashboard
  • Request success tracking

The ScrapeOps Game Changer

The biggest improvement wasn't Scrapy itselfβ€”it was integrating ScrapeOps:

Before: Manual Proxy Hell

# 50+ lines of proxy management code
proxies = load_proxy_list()
current_proxy = 0

def get_next_proxy():
    global current_proxy
    if current_proxy >= len(proxies):
        current_proxy = 0
        time.sleep(60)  # Wait and pray
    proxy = proxies[current_proxy]
    current_proxy += 1
    return proxy
Enter fullscreen mode Exit fullscreen mode

After: One Line Integration

# settings.py
SCRAPEOPS_PROXY_ENABLED = True
Enter fullscreen mode Exit fullscreen mode

That's it! ScrapeOps handles:

  • Proxy rotation
  • IP geolocation
  • User agent rotation
  • Request success monitoring
  • Automatic retries

Monitoring Dashboard

The ScrapeOps dashboard shows real-time metrics:

  • Request success rates
  • Response times
  • Error patterns
  • Bandwidth usage

So.. No more guessing what went wrong.


Performance Comparison

Metric Old Script New Scrapy Project
Lines of Code 534 150
Success Rate ~70% ~95%
Maintainability Nightmare Easy
Debugging Time Hours Minutes
Scalability Limited Unlimited
Monitoring None Full dashboard

Real-World Usage Examples

Quick Product Search

# Search for iPads
scrapy crawl amazon_search

# Output: data/amazon_search_2024-01-15T10-30-45.csv
Enter fullscreen mode Exit fullscreen mode

Detailed Product Analysis

# Get comprehensive product data
scrapy crawl amazon_product

# Output includes features, images, variants
Enter fullscreen mode Exit fullscreen mode

Custom Keywords

# Edit amazon_search.py
keyword_list = ['mechanical keyboards', 'gaming mice', 'monitors']
Enter fullscreen mode Exit fullscreen mode

Lessons for Fellow Developers

1. Don't Reinvent the Wheel

I wasted months building retry logic and export functionality that Scrapy provides out of the box, plus proxy rotation that ScrapeOps handles.

2. Monitoring is Essential

Without ScrapeOps monitoring, I was flying blind. Now I can optimize based on real data.

3. Architecture Matters

Two focused spiders are infinitely more maintainable than one giant spider.

4. Start with Free Tiers

ScrapeOps free tier (1,000 requests/month) is perfect for development and small projects.


Getting Started (5 Minutes Setup)

  1. Clone and setup:
   git clone https://github.com/Simple-Python-Scrapy-Scrapers/amazon-scrapy-scraper.git
   cd amazon-scrapy-scraper
   python -m venv .venv
   source .venv/bin/activate  # or .venv\Scripts\Activate.ps1 on Windows
   pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode
  1. Get free ScrapeOps API key:

  2. Configure and run:

   # Add API key to amazon_scraper/settings.py
   cd amazon_scraper
   scrapy crawl amazon_search
Enter fullscreen mode Exit fullscreen mode

What's Next?

This architecture scales beautifully. I'm already planning:

  • Review scraper for sentiment analysis
  • Price tracking with alerts
  • Multi-marketplace support (eBay, Walmart)
  • API endpoints for real-time data

The clean foundation makes adding features straightforward.


Repository & Resources

πŸ”— Full source code: amazon-scrapy-scraper

πŸ“š Helpful resources:


Discussion

Have you refactored similar projects? What architectural patterns work best for web scraping?

Drop a comment with:

  • Your biggest scraping pain points
  • Scrapy tips and tricks
  • Alternative tools you've tried

Let's help each other build better scrapers!


P.S. - Web scraping should always respect robots.txt and rate limits. This project is for educational purposes.

Top comments (2)

Collapse
 
syxaxis profile image
George Johnson

I'm seriously getting into Python. Last week I recoded a 2200 line Golang utility into 200 lines of Python and my Python loving workmate won't stop saying, "Told you so!". I hate him so much right now for being right these past 2 years, ha ha!!!

Collapse
 
noorsimar profile image
Noorsimar Singh • Edited

haha, love this! nothing like having a workmate _gloat _ for two years… and then realizing they were onto something all along! I totally get the feeling - my early scraping scripts felt like badge-of-honor messes until I bit the bullet and rebuilt everything in Python/Scrapy. It’s wild how much cleaner and shorter the code can get.

What surprised you most when switching from Go to Python? Anything you miss from Go, or are you fully converted now?