DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Using Proxies with Scrapy: The Beginner's Guide

I scraped 100 pages from a website. Everything worked perfectly. Then I tried to scrape 1,000 pages.

After page 150, the website blocked me. My IP address was banned. I couldn't even visit the website normally anymore.

I had to wait 24 hours for the ban to lift. Then I learned about proxies, and I could scrape thousands of pages without any problems.

Let me show you what proxies are and how to use them with Scrapy, in the simplest way possible.


What is a Proxy? (Super Simple Explanation)

Imagine you want to send a letter, but you don't want the receiver to know your address.

Without proxy:

You → Letter → Receiver
(Receiver sees your address)
Enter fullscreen mode Exit fullscreen mode

With proxy:

You → Friend's house → Letter → Receiver
(Receiver sees friend's address, not yours)
Enter fullscreen mode Exit fullscreen mode

A proxy is like having a friend send the letter for you. The receiver sees your friend's address instead of yours.

In web scraping:

  • You send request to proxy
  • Proxy sends request to website
  • Website sees proxy's IP, not yours
  • Proxy sends response back to you

Why Do You Need Proxies?

Problem 1: IP Bans

Websites track how many requests come from each IP address.

Without proxy:

Your IP: 100.200.300.400
Request 1 → Website sees 100.200.300.400
Request 2 → Website sees 100.200.300.400
Request 3 → Website sees 100.200.300.400
...
Request 100 → Website says "Too many requests! BANNED!"
Enter fullscreen mode Exit fullscreen mode

With proxy:

Request 1 → Proxy 1 → Website sees 1.1.1.1
Request 2 → Proxy 2 → Website sees 2.2.2.2
Request 3 → Proxy 3 → Website sees 3.3.3.3
...
Request 100 → Proxy 4 → Website sees 4.4.4.4
(Website never sees same IP too many times)
Enter fullscreen mode Exit fullscreen mode

Problem 2: Geographic Restrictions

Some websites only work in certain countries.

Example:

  • Website only works in USA
  • You're in India
  • Website blocks you

With USA proxy:

  • You connect through USA proxy
  • Website thinks you're in USA
  • Website works!

Problem 3: Rate Limiting

Websites limit requests per IP.

Example:

  • Website allows 10 requests per minute per IP
  • You want to make 100 requests per minute

With 10 proxies:

  • Each proxy makes 10 requests
  • Total: 100 requests per minute
  • No limits hit!

Types of Proxies (Simple Version)

1. Free Proxies

What they are:

  • Free proxy lists online
  • Anyone can use them

Pros:

  • Free!
  • Good for testing

Cons:

  • Slow
  • Often don't work
  • Not secure
  • Shared with many users

When to use:

  • Just learning
  • Testing your code
  • Small projects

2. Paid Proxies (Datacenter)

What they are:

  • Proxies from data centers
  • You pay to use them

Pros:

  • Fast
  • Reliable
  • Not expensive

Cons:

  • Websites can detect them
  • Might still get blocked

Cost:

  • $1-$5 per IP per month

When to use:

  • Medium projects
  • When free proxies don't work

3. Residential Proxies

What they are:

  • Real home internet connections
  • Look like real users

Pros:

  • Very hard to detect
  • Rarely get blocked
  • Best quality

Cons:

  • Expensive
  • Slower than datacenter

Cost:

  • $5-$15 per GB of traffic

When to use:

  • Serious projects
  • Websites with strong anti-bot
  • Professional scraping

Getting Free Proxies (For Practice)

Method 1: Free Proxy Lists

Websites that list free proxies:

Example free proxy:

IP: 123.45.67.89
Port: 8080
Enter fullscreen mode Exit fullscreen mode

How to test if it works:

import requests

proxy = {
    'http': 'http://123.45.67.89:8080',
    'https': 'http://123.45.67.89:8080'
}

try:
    response = requests.get('http://example.com', proxies=proxy, timeout=5)
    print("Proxy works!")
except:
    print("Proxy doesn't work")
Enter fullscreen mode Exit fullscreen mode

Method 2: Using Python to Get Free Proxies

import requests
from bs4 import BeautifulSoup

def get_free_proxies():
    url = 'https://free-proxy-list.net'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    proxies = []
    for row in soup.find('table').find_all('tr')[1:]:
        cols = row.find_all('td')
        if len(cols) > 6:
            ip = cols[0].text
            port = cols[1].text
            proxies.append(f'{ip}:{port}')

    return proxies

# Get list of proxies
proxy_list = get_free_proxies()
print(f"Found {len(proxy_list)} proxies")
Enter fullscreen mode Exit fullscreen mode

Using Proxies in Scrapy (Simple Way)

Method 1: Single Proxy (Easiest)

Set one proxy for all requests:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110
}

# Your proxy
PROXY = 'http://123.45.67.89:8080'
Enter fullscreen mode Exit fullscreen mode

Then in your spider:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = ['https://example.com']
        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={'proxy': 'http://123.45.67.89:8080'}
            )

    def parse(self, response):
        yield {'data': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Every request goes through the proxy
  • Website sees proxy IP, not yours

Method 2: Rotating Proxies (Better)

Use different proxy for each request:

# middlewares.py
import random

class RotateProxyMiddleware:
    def __init__(self):
        # List of proxies
        self.proxies = [
            'http://123.45.67.89:8080',
            'http://98.76.54.32:8080',
            'http://111.222.333.444:8080',
        ]

    def process_request(self, request, spider):
        # Pick random proxy
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        spider.logger.info(f'Using proxy: {proxy}')
Enter fullscreen mode Exit fullscreen mode

Enable in settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}
Enter fullscreen mode Exit fullscreen mode

What this does:

  • Each request uses different proxy
  • Harder to detect and block

Step-by-Step: Your First Proxy Spider

Let's create a complete example from scratch.

Step 1: Get a Free Proxy

Go to https://free-proxy-list.net and copy one proxy:

Example:
IP: 45.76.97.183
Port: 8080
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Your Spider

# myspider.py
import scrapy

class ProxySpider(scrapy.Spider):
    name = 'proxyspider'
    start_urls = ['http://httpbin.org/ip']  # This shows your IP

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={'proxy': 'http://45.76.97.183:8080'}
            )

    def parse(self, response):
        # This will show the proxy's IP, not yours!
        print(response.text)
        yield {'ip': response.json()['origin']}
Enter fullscreen mode Exit fullscreen mode

Step 3: Run It

scrapy crawl proxyspider
Enter fullscreen mode Exit fullscreen mode

Step 4: Check the Output

You should see the proxy's IP address, not your real IP!

{"ip": "45.76.97.183"}
Enter fullscreen mode Exit fullscreen mode

Success! You used a proxy!


Rotating Proxies (Complete Example)

Here's a complete working example with proxy rotation:

Create Middleware

# middlewares.py
import random

class RotateProxyMiddleware:
    def __init__(self):
        # List of free proxies (test these first!)
        self.proxies = [
            'http://45.76.97.183:8080',
            'http://103.149.194.10:36107',
            'http://195.158.14.118:3128',
        ]

    def process_request(self, request, spider):
        # Pick random proxy
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy

        # Log which proxy we're using
        spider.logger.info(f'Request {request.url} using proxy {proxy}')

    @classmethod
    def from_crawler(cls, crawler):
        return cls()
Enter fullscreen mode Exit fullscreen mode

Enable Middleware

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}
Enter fullscreen mode Exit fullscreen mode

Create Spider

# spider.py
import scrapy

class RotatingProxySpider(scrapy.Spider):
    name = 'rotating'
    start_urls = [
        'http://httpbin.org/ip',
        'http://httpbin.org/ip',
        'http://httpbin.org/ip',
    ]

    def parse(self, response):
        # Each request should show different IP
        yield {
            'url': response.url,
            'ip': response.json()['origin']
        }
Enter fullscreen mode Exit fullscreen mode

Run It

scrapy crawl rotating
Enter fullscreen mode Exit fullscreen mode

You should see different IPs for each request!


Using Paid Proxies (Better Quality)

If free proxies don't work, use paid services.

Popular Proxy Services

1. Bright Data (expensive, best quality)

2. SmartProxy (good balance)

3. ProxyMesh (simple, cheap)

Using Paid Proxy Service

Most services give you a single endpoint:

# Instead of rotating yourself
proxy = 'http://username:password@proxy.service.com:8080'

# In spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url,
            meta={'proxy': proxy}
        )
Enter fullscreen mode Exit fullscreen mode

The service rotates proxies automatically!


Testing Proxies

Before using proxies, test if they work:

Simple Test Script

import requests

def test_proxy(proxy):
    """Test if a proxy works"""
    proxies = {
        'http': proxy,
        'https': proxy
    }

    try:
        response = requests.get(
            'http://httpbin.org/ip',
            proxies=proxies,
            timeout=5
        )
        if response.status_code == 200:
            print(f"{proxy} works!")
            return True
        else:
            print(f"{proxy} failed (status {response.status_code})")
            return False
    except Exception as e:
        print(f"{proxy} failed ({str(e)})")
        return False

# Test your proxies
proxies = [
    'http://45.76.97.183:8080',
    'http://103.149.194.10:36107',
    'http://195.158.14.118:3128',
]

working_proxies = []
for proxy in proxies:
    if test_proxy(proxy):
        working_proxies.append(proxy)

print(f"\n{len(working_proxies)} out of {len(proxies)} proxies work")
Enter fullscreen mode Exit fullscreen mode

Only use proxies that pass the test!


Common Problems and Solutions

Problem 1: Proxy Doesn't Work

Error:

ProxyError: Cannot connect to proxy
Enter fullscreen mode Exit fullscreen mode

Solutions:

  1. Proxy is dead (try another one)
  2. Wrong format (should be http://IP:PORT)
  3. Needs authentication (use http://user:pass@IP:PORT)

Problem 2: Still Getting Blocked

Even with proxies, you get banned?

Reasons:

  1. Using same proxy too much (rotate more)
  2. No delays between requests (add DOWNLOAD_DELAY)
  3. Bad User-Agent (add realistic headers)
  4. Cookies tracking you (clear cookies between requests)

Solution:

# settings.py
DOWNLOAD_DELAY = 2  # Wait 2 seconds
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'
COOKIES_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

Problem 3: Proxies Too Slow

Free proxies are very slow?

Solutions:

  1. Test proxies first, only use fast ones
  2. Increase timeout: DOWNLOAD_TIMEOUT = 30
  3. Use paid proxies (much faster)
  4. Use more concurrent requests: CONCURRENT_REQUESTS = 16

Problem 4: Authentication Required

Some proxies need username and password:

# Format: http://username:password@IP:PORT
proxy = 'http://myuser:mypass@123.45.67.89:8080'

# In spider
meta={'proxy': proxy}
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Always Test Proxies First

Don't use proxies without testing:

# Test before adding to list
if test_proxy(proxy):
    working_proxies.append(proxy)
Enter fullscreen mode Exit fullscreen mode

2. Rotate Proxies

Don't use same proxy for all requests:

# Good: rotate
proxy = random.choice(proxy_list)

# Bad: always same
proxy = 'http://123.45.67.89:8080'
Enter fullscreen mode Exit fullscreen mode

3. Add Delays Even With Proxies

Proxies don't mean you can spam:

# settings.py
DOWNLOAD_DELAY = 1
Enter fullscreen mode Exit fullscreen mode

4. Monitor Proxy Performance

Track which proxies work best:

class ProxyStatsMiddleware:
    def __init__(self):
        self.stats = {}

    def process_response(self, request, response, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            if proxy not in self.stats:
                self.stats[proxy] = {'success': 0, 'fail': 0}

            if response.status == 200:
                self.stats[proxy]['success'] += 1
            else:
                self.stats[proxy]['fail'] += 1

        return response
Enter fullscreen mode Exit fullscreen mode

5. Have Backup Proxies

Always have more proxies than you need:

# Good: 20 proxies for scraping 100 pages
# Bad: 2 proxies for scraping 1000 pages
Enter fullscreen mode Exit fullscreen mode

When You DON'T Need Proxies

Proxies aren't always necessary:

You DON'T need proxies if:

  • Scraping less than 100 pages
  • Website has no rate limiting
  • You add proper delays
  • Small personal project
  • Website explicitly allows scraping

You DO need proxies if:

  • Scraping thousands of pages
  • Website blocks after few requests
  • Need to bypass geo-restrictions
  • Professional/commercial scraping
  • Website has strict anti-bot

Free vs Paid: What to Choose?

Use Free Proxies When:

  • Learning and practicing
  • Testing your spider
  • Small one-time projects
  • Scraping <1000 pages

Use Paid Proxies When:

  • Professional projects
  • Scraping >10,000 pages
  • Need reliability
  • Time is valuable
  • Can't afford to get blocked

My recommendation for beginners:
Start with free proxies for learning. When you need reliability, invest in paid proxies.


Complete Real Example

Here's everything together:

Project Structure

myproject/
├── scrapy.cfg
├── myproject/
│   ├── __init__.py
│   ├── settings.py
│   ├── middlewares.py
│   └── spiders/
│       └── product_spider.py
Enter fullscreen mode Exit fullscreen mode

middlewares.py

import random

class RotateProxyMiddleware:
    def __init__(self):
        self.proxies = [
            'http://45.76.97.183:8080',
            'http://103.149.194.10:36107',
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        spider.logger.info(f'Using proxy: {proxy}')

    @classmethod
    def from_crawler(cls, crawler):
        return cls()
Enter fullscreen mode Exit fullscreen mode

settings.py

# Proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}

# Be polite
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Look like real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Don't save cookies
COOKIES_ENABLED = False
Enter fullscreen mode Exit fullscreen mode

product_spider.py

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow next page
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

This spider:

  • Rotates between proxies
  • Adds delays
  • Uses realistic headers
  • Follows pagination
  • Logs everything

Perfect!


Quick Reference

Add single proxy:

meta={'proxy': 'http://123.45.67.89:8080'}
Enter fullscreen mode Exit fullscreen mode

Add proxy with auth:

meta={'proxy': 'http://user:pass@123.45.67.89:8080'}
Enter fullscreen mode Exit fullscreen mode

Rotate proxies:

proxy = random.choice(proxy_list)
meta={'proxy': proxy}
Enter fullscreen mode Exit fullscreen mode

Test proxy:

response = requests.get('http://httpbin.org/ip', proxies={'http': proxy})
Enter fullscreen mode Exit fullscreen mode

Summary

What are proxies?
Intermediary servers that hide your real IP address.

Why use them?

  • Avoid IP bans
  • Bypass rate limits
  • Access geo-restricted content
  • Scrape at scale

Types:

  • Free: For learning
  • Paid Datacenter: For medium projects
  • Residential: For serious projects

Basic usage in Scrapy:

meta={'proxy': 'http://IP:PORT'}
Enter fullscreen mode Exit fullscreen mode

Rotating proxies:

proxy = random.choice(proxy_list)
meta={'proxy': proxy}
Enter fullscreen mode Exit fullscreen mode

Best practices:

  • Test proxies first
  • Rotate proxies
  • Add delays anyway
  • Monitor performance
  • Start with free, upgrade to paid when needed

Remember:
Proxies are a tool, not a license to spam. Always be respectful, add delays, and follow robots.txt even when using proxies.

Happy scraping! 🕷️

Top comments (0)