Swiftproxy - Residential Proxies

Posted on Jul 15

Building Scalable Web Scrapers with Python and Scrapy from Scratch

#webscraping

Scrapy isn’t just another Python framework. It’s the industry workhorse for fast, efficient, and customizable web scraping. Its asynchronous design means lightning-fast scraping speeds. Add in middleware, and you’re crafting a scraper tailored to your exact needs.
But speed alone won’t get you far. Big data grabs require stealth. Proxies, user-agent rotation, and anti-detection techniques are non-negotiable. This guide dives deep into all of that, showing you how to start, scale, and safeguard your Scrapy projects.

Setting Up Scrapy on Windows

First things first, install Python. If you haven’t installed it, grab the latest Windows version (3.13.3 as of writing) from the official source. One crucial tip — during installation, check the box to add Python to your system PATH. This makes running Python commands in Command Prompt seamless.
Next, fire up Command Prompt and install Scrapy with a simple command:

pip install scrapy

Give it a moment — you’ll see a success message once it’s done.
To launch your project, run:

scrapy startproject your_project_name

Let’s call ours ScrapyTutorial.
Scrapy will scaffold your project with a clean, logical structure:

scrapy.cfg holds project-wide settings.
items.py defines the data structure you want to scrape.
pipelines.py controls how you process scraped data.
The spiders folder is where your spiders live — each spider is a Python class that dictates your scraping rules.

Crafting Your First Spider

Navigate into your project folder:

cd ScrapyTutorial

Generate a spider for your target site:

scrapy genspider SpiderName example.com

Don’t run the spider file directly with Python. You’ll want an IDE like Visual Studio Code to edit it. Open it up and get ready to tweak.
Here’s the barebones spider setup:

allowed_domains = ['example.com']

def parse(self, response):
    pass

allowed_domains keeps your spider focused. Without it, your scraper might wander off, hitting unintended sites — wasting time and risking bans.

Extracting Data

Let’s move from a skeleton spider to one that extracts useful info.
Run your spider from Command Prompt like this:

scrapy crawl SpiderName

If your parse() function just has pass, it’ll finish quickly but show no data.
Replace pass with:

print(response.body.decode('utf-8'))

Now, run the crawl again. The Command Prompt will dump the raw HTML of your target page.
Looks messy? That’s the raw web — unfiltered, complex.
To extract exactly what you want, you’ll rely on CSS selectors — a powerful way to zero in on page elements.

Utilizing CSS Selectors to Target Data

Open your target site in a browser, then hit Ctrl + Shift + I (or right-click and choose Inspect). This reveals the page’s HTML.
Identify the element you want — say, pricing data inside:

<p class="tp-headline-m text-neutral-0">$0.22</p>

Notice those classes? Perfect for CSS selectors.
In your spider’s parse() method, add:

pricing = response.css('[class="tp-headline-m text-neutral-0"]::text').getall()
if pricing:
    print("Price details:")
    for price in pricing:
        print(f"- {price.strip()}")

Boom. This tells Scrapy to grab every <p> tag with exactly those two classes and extract the text inside.
You get a neat list of prices instead of a flood of HTML.

Utilizing XPath

If CSS selectors aren’t enough, XPath can slice through the DOM with surgical precision.
XPath lets you navigate by element position, hierarchy, and attributes — invaluable for complex pages.
Example:

//*/parent::p

This finds every paragraph element that’s a parent node anywhere on the page.
Get comfortable with XPath axes like child::, parent::, and following-sibling:: — they’re like GPS for the HTML tree.

Tackling JavaScript and Dynamic Content

Websites are getting smarter. JavaScript loads data after the page loads — often invisible to Scrapy, which only fetches raw HTML.
Price updates, weather widgets, interactive buttons — all powered by JS.
Enter tools like Selenium and Playwright.
Selenium automates real browsers, simulating clicks, logins, scrolling — everything you need to grab JS-loaded data.
Playwright, from Microsoft, is faster and handles waiting for page load out of the box.
Both integrate with Scrapy via middleware, turning your scraper into a powerhouse that can mimic real user behavior.

Employing Proxies to Prevent Blocks

Scraping from a single IP? You’re asking for trouble.
Sites block suspicious IPs quickly. The fix: proxy rotation.
We recommend residential proxies — these mimic real users better than data center IPs, reducing block risks.
Install rotating proxy middleware with:

pip install scrapy-rotating-proxies

Add this to your settings.py:

ROTATING_PROXY_LIST = [
  'http://username:password@proxy_address:port',
]

DOWNLOADER_MIDDLEWARES = {
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
  'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
  'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

Save and run your spider as usual.
You can add multiple proxies to the list for automatic rotation — foolproof your scraper from bans and CAPTCHAs.

Anti-Detection Musts

Rotating IPs alone won’t fool every site.
Rotate user-agent strings to simulate requests from different browsers/devices.
Manage sessions and cookies carefully. Scrapy’s built-in CookiesMiddleware helps here.
Throttle request rates with DOWNLOAD_DELAY in settings — scraping too fast screams “bot!”

Common Errors and Solutions

407 Proxy Authentication Error: Proxy creds must be formatted like http://username:password@host:port. Any deviation triggers this.
Proxy Downtime: Residential proxies can go offline if the device disconnects. Use a proxy checker tool to verify.
403 Forbidden: Your IP or user-agent got flagged. Increase delays, rotate proxies and user-agents, and double-check your headers.

Final Thoughts

Combining Scrapy with proxies provides a powerful way to scale web scraping projects both efficiently and discreetly. However, to handle JavaScript-heavy websites, it’s important to also become proficient with tools like Selenium or Playwright. Keep yourself updated on the latest anti-scraping techniques, as they are constantly evolving.

DEV Community