Muhammad Ikramullah Khan

Posted on Jan 20

Using Proxies with Scrapy: The Beginner's Guide

#python #webdev #programming #beginners

I scraped 100 pages from a website. Everything worked perfectly. Then I tried to scrape 1,000 pages.

After page 150, the website blocked me. My IP address was banned. I couldn't even visit the website normally anymore.

I had to wait 24 hours for the ban to lift. Then I learned about proxies, and I could scrape thousands of pages without any problems.

Let me show you what proxies are and how to use them with Scrapy, in the simplest way possible.

What is a Proxy? (Super Simple Explanation)

Imagine you want to send a letter, but you don't want the receiver to know your address.

Without proxy:

You → Letter → Receiver
(Receiver sees your address)

With proxy:

You → Friend's house → Letter → Receiver
(Receiver sees friend's address, not yours)

A proxy is like having a friend send the letter for you. The receiver sees your friend's address instead of yours.

In web scraping:

You send request to proxy
Proxy sends request to website
Website sees proxy's IP, not yours
Proxy sends response back to you

Why Do You Need Proxies?

Problem 1: IP Bans

Websites track how many requests come from each IP address.

Without proxy:

Your IP: 100.200.300.400
Request 1 → Website sees 100.200.300.400
Request 2 → Website sees 100.200.300.400
Request 3 → Website sees 100.200.300.400
...
Request 100 → Website says "Too many requests! BANNED!"

With proxy:

Request 1 → Proxy 1 → Website sees 1.1.1.1
Request 2 → Proxy 2 → Website sees 2.2.2.2
Request 3 → Proxy 3 → Website sees 3.3.3.3
...
Request 100 → Proxy 4 → Website sees 4.4.4.4
(Website never sees same IP too many times)

Problem 2: Geographic Restrictions

Some websites only work in certain countries.

Example:

Website only works in USA
You're in India
Website blocks you

With USA proxy:

You connect through USA proxy
Website thinks you're in USA
Website works!

Problem 3: Rate Limiting

Websites limit requests per IP.

Example:

Website allows 10 requests per minute per IP
You want to make 100 requests per minute

With 10 proxies:

Each proxy makes 10 requests
Total: 100 requests per minute
No limits hit!

Types of Proxies (Simple Version)

1. Free Proxies

What they are:

Free proxy lists online
Anyone can use them

Pros:

Free!
Good for testing

Cons:

Slow
Often don't work
Not secure
Shared with many users

When to use:

Just learning
Testing your code
Small projects

2. Paid Proxies (Datacenter)

What they are:

Proxies from data centers
You pay to use them

Pros:

Fast
Reliable
Not expensive

Cons:

Websites can detect them
Might still get blocked

Cost:

$1-$5 per IP per month

When to use:

Medium projects
When free proxies don't work

3. Residential Proxies

What they are:

Real home internet connections
Look like real users

Pros:

Very hard to detect
Rarely get blocked
Best quality

Cons:

Expensive
Slower than datacenter

Cost:

$5-$15 per GB of traffic

When to use:

Serious projects
Websites with strong anti-bot
Professional scraping

Getting Free Proxies (For Practice)

Method 1: Free Proxy Lists

Websites that list free proxies:

Example free proxy:

IP: 123.45.67.89
Port: 8080

How to test if it works:

import requests

proxy = {
    'http': 'http://123.45.67.89:8080',
    'https': 'http://123.45.67.89:8080'
}

try:
    response = requests.get('http://example.com', proxies=proxy, timeout=5)
    print("Proxy works!")
except:
    print("Proxy doesn't work")

Method 2: Using Python to Get Free Proxies

import requests
from bs4 import BeautifulSoup

def get_free_proxies():
    url = 'https://free-proxy-list.net'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    proxies = []
    for row in soup.find('table').find_all('tr')[1:]:
        cols = row.find_all('td')
        if len(cols) > 6:
            ip = cols[0].text
            port = cols[1].text
            proxies.append(f'{ip}:{port}')

    return proxies

# Get list of proxies
proxy_list = get_free_proxies()
print(f"Found {len(proxy_list)} proxies")

Using Proxies in Scrapy (Simple Way)

Method 1: Single Proxy (Easiest)

Set one proxy for all requests:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110
}

# Your proxy
PROXY = 'http://123.45.67.89:8080'

Then in your spider:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        urls = ['https://example.com']
        for url in urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={'proxy': 'http://123.45.67.89:8080'}
            )

    def parse(self, response):
        yield {'data': response.css('h1::text').get()}

What this does:

Every request goes through the proxy
Website sees proxy IP, not yours

Method 2: Rotating Proxies (Better)

Use different proxy for each request:

# middlewares.py
import random

class RotateProxyMiddleware:
    def __init__(self):
        # List of proxies
        self.proxies = [
            'http://123.45.67.89:8080',
            'http://98.76.54.32:8080',
            'http://111.222.333.444:8080',
        ]

    def process_request(self, request, spider):
        # Pick random proxy
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        spider.logger.info(f'Using proxy: {proxy}')

Enable in settings:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}

What this does:

Each request uses different proxy
Harder to detect and block

Step-by-Step: Your First Proxy Spider

Let's create a complete example from scratch.

Step 1: Get a Free Proxy

Go to https://free-proxy-list.net and copy one proxy:

Example:
IP: 45.76.97.183
Port: 8080

Step 2: Create Your Spider

# myspider.py
import scrapy

class ProxySpider(scrapy.Spider):
    name = 'proxyspider'
    start_urls = ['http://httpbin.org/ip']  # This shows your IP

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                callback=self.parse,
                meta={'proxy': 'http://45.76.97.183:8080'}
            )

    def parse(self, response):
        # This will show the proxy's IP, not yours!
        print(response.text)
        yield {'ip': response.json()['origin']}

Step 3: Run It

scrapy crawl proxyspider

Step 4: Check the Output

You should see the proxy's IP address, not your real IP!

{"ip": "45.76.97.183"}

Success! You used a proxy!

Rotating Proxies (Complete Example)

Here's a complete working example with proxy rotation:

Create Middleware

# middlewares.py
import random

class RotateProxyMiddleware:
    def __init__(self):
        # List of free proxies (test these first!)
        self.proxies = [
            'http://45.76.97.183:8080',
            'http://103.149.194.10:36107',
            'http://195.158.14.118:3128',
        ]

    def process_request(self, request, spider):
        # Pick random proxy
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy

        # Log which proxy we're using
        spider.logger.info(f'Request {request.url} using proxy {proxy}')

    @classmethod
    def from_crawler(cls, crawler):
        return cls()

Enable Middleware

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}

Create Spider

# spider.py
import scrapy

class RotatingProxySpider(scrapy.Spider):
    name = 'rotating'
    start_urls = [
        'http://httpbin.org/ip',
        'http://httpbin.org/ip',
        'http://httpbin.org/ip',
    ]

    def parse(self, response):
        # Each request should show different IP
        yield {
            'url': response.url,
            'ip': response.json()['origin']
        }

Run It

scrapy crawl rotating

You should see different IPs for each request!

Using Paid Proxies (Better Quality)

If free proxies don't work, use paid services.

Popular Proxy Services

1. Bright Data (expensive, best quality)

https://brightdata.com
Cost: ~$500/month minimum
Residential proxies

2. SmartProxy (good balance)

https://smartproxy.com
Cost: ~$75/month for 5GB
Residential proxies

3. ProxyMesh (simple, cheap)

https://proxymesh.com
Cost: ~$10/month
Datacenter proxies

Using Paid Proxy Service

Most services give you a single endpoint:

# Instead of rotating yourself
proxy = 'http://username:password@proxy.service.com:8080'

# In spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url,
            meta={'proxy': proxy}
        )

The service rotates proxies automatically!

Testing Proxies

Before using proxies, test if they work:

Simple Test Script

import requests

def test_proxy(proxy):
    """Test if a proxy works"""
    proxies = {
        'http': proxy,
        'https': proxy
    }

    try:
        response = requests.get(
            'http://httpbin.org/ip',
            proxies=proxies,
            timeout=5
        )
        if response.status_code == 200:
            print(f"✓ {proxy} works!")
            return True
        else:
            print(f"✗ {proxy} failed (status {response.status_code})")
            return False
    except Exception as e:
        print(f"✗ {proxy} failed ({str(e)})")
        return False

# Test your proxies
proxies = [
    'http://45.76.97.183:8080',
    'http://103.149.194.10:36107',
    'http://195.158.14.118:3128',
]

working_proxies = []
for proxy in proxies:
    if test_proxy(proxy):
        working_proxies.append(proxy)

print(f"\n{len(working_proxies)} out of {len(proxies)} proxies work")

Only use proxies that pass the test!

Common Problems and Solutions

Problem 1: Proxy Doesn't Work

Error:

ProxyError: Cannot connect to proxy

Solutions:

Proxy is dead (try another one)
Wrong format (should be http://IP:PORT)
Needs authentication (use http://user:pass@IP:PORT)

Problem 2: Still Getting Blocked

Even with proxies, you get banned?

Reasons:

Using same proxy too much (rotate more)
No delays between requests (add DOWNLOAD_DELAY)
Bad User-Agent (add realistic headers)
Cookies tracking you (clear cookies between requests)

Solution:

# settings.py
DOWNLOAD_DELAY = 2  # Wait 2 seconds
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'
COOKIES_ENABLED = False

Problem 3: Proxies Too Slow

Free proxies are very slow?

Solutions:

Test proxies first, only use fast ones
Increase timeout: DOWNLOAD_TIMEOUT = 30
Use paid proxies (much faster)
Use more concurrent requests: CONCURRENT_REQUESTS = 16

Problem 4: Authentication Required

Some proxies need username and password:

# Format: http://username:password@IP:PORT
proxy = 'http://myuser:mypass@123.45.67.89:8080'

# In spider
meta={'proxy': proxy}

Best Practices

1. Always Test Proxies First

Don't use proxies without testing:

# Test before adding to list
if test_proxy(proxy):
    working_proxies.append(proxy)

2. Rotate Proxies

Don't use same proxy for all requests:

# Good: rotate
proxy = random.choice(proxy_list)

# Bad: always same
proxy = 'http://123.45.67.89:8080'

3. Add Delays Even With Proxies

Proxies don't mean you can spam:

# settings.py
DOWNLOAD_DELAY = 1

4. Monitor Proxy Performance

Track which proxies work best:

class ProxyStatsMiddleware:
    def __init__(self):
        self.stats = {}

    def process_response(self, request, response, spider):
        proxy = request.meta.get('proxy')
        if proxy:
            if proxy not in self.stats:
                self.stats[proxy] = {'success': 0, 'fail': 0}

            if response.status == 200:
                self.stats[proxy]['success'] += 1
            else:
                self.stats[proxy]['fail'] += 1

        return response

5. Have Backup Proxies

Always have more proxies than you need:

# Good: 20 proxies for scraping 100 pages
# Bad: 2 proxies for scraping 1000 pages

When You DON'T Need Proxies

Proxies aren't always necessary:

You DON'T need proxies if:

Scraping less than 100 pages
Website has no rate limiting
You add proper delays
Small personal project
Website explicitly allows scraping

You DO need proxies if:

Scraping thousands of pages
Website blocks after few requests
Need to bypass geo-restrictions
Professional/commercial scraping
Website has strict anti-bot

Free vs Paid: What to Choose?

Use Free Proxies When:

Learning and practicing
Testing your spider
Small one-time projects
Scraping <1000 pages

Use Paid Proxies When:

Professional projects
Scraping >10,000 pages
Need reliability
Time is valuable
Can't afford to get blocked

My recommendation for beginners:
Start with free proxies for learning. When you need reliability, invest in paid proxies.

Complete Real Example

Here's everything together:

Project Structure

myproject/
├── scrapy.cfg
├── myproject/
│   ├── __init__.py
│   ├── settings.py
│   ├── middlewares.py
│   └── spiders/
│       └── product_spider.py

middlewares.py

import random

class RotateProxyMiddleware:
    def __init__(self):
        self.proxies = [
            'http://45.76.97.183:8080',
            'http://103.149.194.10:36107',
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy
        spider.logger.info(f'Using proxy: {proxy}')

    @classmethod
    def from_crawler(cls, crawler):
        return cls()

settings.py

# Proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotateProxyMiddleware': 350,
}

# Be polite
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Look like real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Don't save cookies
COOKIES_ENABLED = False

product_spider.py

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': product.css('.price::text').get()
            }

        # Follow next page
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

This spider:

Rotates between proxies
Adds delays
Uses realistic headers
Follows pagination
Logs everything

Perfect!

Quick Reference

Add single proxy:

meta={'proxy': 'http://123.45.67.89:8080'}

Add proxy with auth:

meta={'proxy': 'http://user:pass@123.45.67.89:8080'}

Rotate proxies:

proxy = random.choice(proxy_list)
meta={'proxy': proxy}

Test proxy:

response = requests.get('http://httpbin.org/ip', proxies={'http': proxy})

Summary

What are proxies?
Intermediary servers that hide your real IP address.

Why use them?

Avoid IP bans
Bypass rate limits
Access geo-restricted content
Scrape at scale

Types:

Free: For learning
Paid Datacenter: For medium projects
Residential: For serious projects

Basic usage in Scrapy:

meta={'proxy': 'http://IP:PORT'}

Rotating proxies:

proxy = random.choice(proxy_list)
meta={'proxy': proxy}

Best practices:

Test proxies first
Rotate proxies
Add delays anyway
Monitor performance
Start with free, upgrade to paid when needed

Remember:
Proxies are a tool, not a license to spam. Always be respectful, add delays, and follow robots.txt even when using proxies.

Happy scraping! 🕷️