Kev the bur

Posted on May 2

Proxy Integration with Scrapy

#tutorial #proxies #automation

How to Integrate Proxies with Scrapy for Effective Web Scraping

Scrapy is a robust Python framework beloved by developers for building fast and scalable web crawlers. When tackling complex scraping tasks, protecting your IP and maintaining stable access to target websites is essential. One effective approach is using proxies to route your requests, helping evade blocks and preserve your anonymity.

In this guide, we'll walk you through setting up Scrapy from scratch and integrating proxies using DataImpulse — a reliable proxy provider. You’ll learn two proxy integration methods: passing proxies directly with requests and creating custom middleware for proxy handling. Finally, we’ll explore how to configure rotating proxies to boost your scraping resilience.

Getting Started with Scrapy: Setup and Basic Spider

Before diving into proxies, let’s set up a basic Scrapy project that scrapes book titles and prices from a sample site.

Installing Scrapy

Open your terminal and install Scrapy via pip:

pip install scrapy

Creating a New Scrapy Project

Create a project named scrapyproject:

scrapy startproject scrapyproject

Navigate into the project directory:

cd scrapyproject

Generating a Spider

Generate a spider to crawl “Books to Scrape”:

scrapy genspider books books.toscrape.com

This creates a books.py file inside the spiders folder with the basic spider boilerplate.

Customizing Your Spider to Extract Data

Open books.py and modify it to extract book titles and prices:

import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'

    def start_requests(self):
        start_urls = ['http://books.toscrape.com/']
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for article in response.css('article.product_pod'):
            yield {
                'title': article.css('h3 > a::attr(title)').get(),
                'price': article.css('.price_color::text').get()
            }

start_requests sends the initial HTTP request.
parse extracts the desired fields using CSS selectors.

You can run the spider like this to see results in the console:

scrapy crawl books

Or save the data to a CSV:

scrapy crawl books -o books.csv

Why Use Proxies with Scrapy?

Websites often limit the request rate per IP or ban scrapers outright. Proxies help by:

Masking your real IP address.
Distributing requests across multiple IPs.
Reducing the risk of getting blocked.

DataImpulse offers affordable, reliable residential proxies you can integrate seamlessly into Scrapy projects.

Method 1: Using Proxies as a Request Parameter

You can specify a proxy on a per-request basis using the meta argument in scrapy.Request. Here’s how to add a DataImpulse residential proxy to your spider:

import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'

    def start_requests(self):
        start_urls = ['http://books.toscrape.com/']
        proxy = 'http://YourProxyPlanUsername:YourProxyPlanPassword@gw.dataimpulse.com:823'

        for url in start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={'proxy': proxy}
            )

    def parse(self, response):
        for article in response.css('article.product_pod'):
            yield {
                'title': article.css('h3 > a::attr(title)').get(),
                'price': article.css('.price_color::text').get()
            }

Notes:

Replace YourProxyPlanUsername and YourProxyPlanPassword with your actual DataImpulse credentials.
The proxy address uses the gateway gw.dataimpulse.com on port 823 for residential HTTP proxies.
For more customization options such as country-specific proxy entry points, refer to DataImpulse’s documentation.

Method 2: Implementing a Custom Proxy Middleware

For larger projects with multiple spiders, managing proxies via middleware is cleaner and more scalable. Middleware intercepts requests and modifies them without touching individual spiders.

Step 1: Create Proxy Middleware

Inside your project, open or create middlewares.py and add:

class BookProxyMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    def __init__(self, settings):
        self.username = settings.get('PROXY_USER')
        self.password = settings.get('PROXY_PASSWORD')
        self.url = settings.get('PROXY_URL')
        self.port = settings.get('PROXY_PORT')

    def process_request(self, request, spider):
        proxy_url = f'http://{self.username}:{self.password}@{self.url}:{self.port}'
        request.meta['proxy'] = proxy_url

Step 2: Configure Settings

In settings.py, add your proxy credentials and register the middleware:

PROXY_USER = 'YourProxyPlanUsername'
PROXY_PASSWORD = 'YourProxyPlanPassword'
PROXY_URL = 'gw.dataimpulse.com'
PROXY_PORT = '823'

DOWNLOADER_MIDDLEWARES = {
    'scrapyproject.middlewares.BookProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

Benefits:

Credentials and proxy details are centralized.
Easier to update proxy info without touching spiders.
Middleware works transparently for all requests.

Once set up, running scrapy crawl books will route requests through your proxy automatically.

Enhancing Scraping with Rotating Proxies

To avoid bans entirely, rotating proxies provide a pool of IPs that change periodically or per request.

Installing Scrapy Rotating Proxies

pip install scrapy-rotating-proxies

Configure Proxy List

Add a proxy list in settings.py outlining your DataImpulse proxy endpoints with credentials:

ROTATING_PROXY_LIST = [
    'http://YourProxyPlanUsername:YourProxyPlanPassword@gw.dataimpulse.com:823',
    'http://YourProxyPlanUsername:YourProxyPlanPassword@gw.dataimpulse.com:10000',
    'http://YourProxyPlanUsername:YourProxyPlanPassword@Specific_IP_1:10000',
    'http://YourProxyPlanUsername:YourProxyPlanPassword@Specific_IP_2:10000',
    # add more proxies as needed
]

Alternatively, load proxies from a file:

ROTATING_PROXY_LIST_PATH = '/path/to/file/proxieslist.txt'

Update Middleware Settings

Add the rotating proxies middleware to your downloader middlewares:

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

Now, your spider will automatically rotate proxies on each request, improving success rates and minimizing blocks.

Summary

Proxies are vital tools in web scraping to maintain access and protect your identity online. Scrapy offers flexible ways to integrate proxies, either directly per request or through middleware, with additional options for proxy rotation.

Using DataImpulse proxies, you can easily secure affordable and reliable residential IPs for your scraper’s needs.

Start experimenting with proxy integration today to build more robust scraping workflows!

Whether you are a beginner or an experienced scraper, incorporating proxies into your Scrapy projects will help improve your data extraction strategy in a scalable, maintainable way.

DEV Community