I scraped 100 pages from a website. Everything worked perfectly. Then I tried to scrape 1,000 pages.
After page 150, the website blocked me. My IP address was banned. I couldn't even visit the website normally anymore.
I had to wait 24 hours for the ban to lift. Then I learned about proxies, and I could scrape thousands of pages without any problems.
Let me show you what proxies are and how to use them with Scrapy, in the simplest way possible.
What is a Proxy? (Super Simple Explanation)
Imagine you want to send a letter, but you don't want the receiver to know your address.
Without proxy:
You → Letter → Receiver
(Receiver sees your address)
With proxy:
You → Friend's house → Letter → Receiver
(Receiver sees friend's address, not yours)
A proxy is like having a friend send the letter for you. The receiver sees your friend's address instead of yours.
In web scraping:
- You send request to proxy
- Proxy sends request to website
- Website sees proxy's IP, not yours
- Proxy sends response back to you
Why Do You Need Proxies?
Problem 1: IP Bans
Websites track how many requests come from each IP address.
Without proxy:
Your IP: 100.200.300.400
Request 1 → Website sees 100.200.300.400
Request 2 → Website sees 100.200.300.400
Request 3 → Website sees 100.200.300.400
...
Request 100 → Website says "Too many requests! BANNED!"
With proxy:
Request 1 → Proxy 1 → Website sees 1.1.1.1
Request 2 → Proxy 2 → Website sees 2.2.2.2
Request 3 → Proxy 3 → Website sees 3.3.3.3
...
Request 100 → Proxy 4 → Website sees 4.4.4.4
(Website never sees same IP too many times)
Problem 2: Geographic Restrictions
Some websites only work in certain countries.
Example:
- Website only works in USA
- You're in India
- Website blocks you
With USA proxy:
- You connect through USA proxy
- Website thinks you're in USA
- Website works!
Problem 3: Rate Limiting
Websites limit requests per IP.
Example:
- Website allows 10 requests per minute per IP
- You want to make 100 requests per minute
With 10 proxies:
- Each proxy makes 10 requests
- Total: 100 requests per minute
- No limits hit!
Types of Proxies (Simple Version)
1. Free Proxies
What they are:
- Free proxy lists online
- Anyone can use them
Pros:
- Free!
- Good for testing
Cons:
- Slow
- Often don't work
- Not secure
- Shared with many users
When to use:
- Just learning
- Testing your code
- Small projects
2. Paid Proxies (Datacenter)
What they are:
- Proxies from data centers
- You pay to use them
Pros:
- Fast
- Reliable
- Not expensive
Cons:
- Websites can detect them
- Might still get blocked
Cost:
- $1-$5 per IP per month
When to use:
- Medium projects
- When free proxies don't work
3. Residential Proxies
What they are:
- Real home internet connections
- Look like real users
Pros:
- Very hard to detect
- Rarely get blocked
- Best quality
Cons:
- Expensive
- Slower than datacenter
Cost:
- $5-$15 per GB of traffic
When to use:
- Serious projects
- Websites with strong anti-bot
- Professional scraping
Getting Free Proxies (For Practice)
Method 1: Free Proxy Lists
Websites that list free proxies:
Example free proxy:
IP: 123.45.67.89
Port: 8080
How to test if it works:
import requests
proxy = {
'http': 'http://123.45.67.89:8080',
'https': 'http://123.45.67.89:8080'
}
try:
response = requests.get('http://example.com', proxies=proxy, timeout=5)
print("Proxy works!")
except:
print("Proxy doesn't work")
Method 2: Using Python to Get Free Proxies
import requests
from bs4 import BeautifulSoup
def get_free_proxies():
url = 'https://free-proxy-list.net'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
proxies = []
for row in soup.find('table').find_all('tr')[1:]:
cols = row.find_all('td')
if len(cols) > 6:
ip = cols[0].text
port = cols[1].text
proxies.append(f'{ip}:{port}')
return proxies
# Get list of proxies
proxy_list = get_free_proxies()
print(f"Found {len(proxy_list)} proxies")
Using Proxies in Scrapy (Simple Way)
Method 1: Single Proxy (Easiest)
Set one proxy for all requests:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110
}
# Your proxy
PROXY = 'http://123.45.67.89:8080'
Then in your spider:
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
urls = ['https://example.com']
for url in urls:
yield scrapy.Request(
url,
callback=self.parse,
meta={'proxy': 'http://123.45.67.89:8080'}
)
def parse(self, response):
yield {'data': response.css('h1::text').get()}
What this does:
- Every request goes through the proxy
- Website sees proxy IP, not yours
Method 2: Rotating Proxies (Better)
Use different proxy for each request:
# middlewares.py
import random
class RotateProxyMiddleware:
def __init__(self):
# List of proxies
self.proxies = [
'http://123.45.67.89:8080',
'http://98.76.54.32:8080',
'http://111.222.333.444:8080',
]
def process_request(self, request, spider):
# Pick random proxy
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
spider.logger.info(f'Using proxy: {proxy}')
Enable in settings:
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateProxyMiddleware': 350,
}
What this does:
- Each request uses different proxy
- Harder to detect and block
Step-by-Step: Your First Proxy Spider
Let's create a complete example from scratch.
Step 1: Get a Free Proxy
Go to https://free-proxy-list.net and copy one proxy:
Example:
IP: 45.76.97.183
Port: 8080
Step 2: Create Your Spider
# myspider.py
import scrapy
class ProxySpider(scrapy.Spider):
name = 'proxyspider'
start_urls = ['http://httpbin.org/ip'] # This shows your IP
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
callback=self.parse,
meta={'proxy': 'http://45.76.97.183:8080'}
)
def parse(self, response):
# This will show the proxy's IP, not yours!
print(response.text)
yield {'ip': response.json()['origin']}
Step 3: Run It
scrapy crawl proxyspider
Step 4: Check the Output
You should see the proxy's IP address, not your real IP!
{"ip": "45.76.97.183"}
Success! You used a proxy!
Rotating Proxies (Complete Example)
Here's a complete working example with proxy rotation:
Create Middleware
# middlewares.py
import random
class RotateProxyMiddleware:
def __init__(self):
# List of free proxies (test these first!)
self.proxies = [
'http://45.76.97.183:8080',
'http://103.149.194.10:36107',
'http://195.158.14.118:3128',
]
def process_request(self, request, spider):
# Pick random proxy
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
# Log which proxy we're using
spider.logger.info(f'Request {request.url} using proxy {proxy}')
@classmethod
def from_crawler(cls, crawler):
return cls()
Enable Middleware
# settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateProxyMiddleware': 350,
}
Create Spider
# spider.py
import scrapy
class RotatingProxySpider(scrapy.Spider):
name = 'rotating'
start_urls = [
'http://httpbin.org/ip',
'http://httpbin.org/ip',
'http://httpbin.org/ip',
]
def parse(self, response):
# Each request should show different IP
yield {
'url': response.url,
'ip': response.json()['origin']
}
Run It
scrapy crawl rotating
You should see different IPs for each request!
Using Paid Proxies (Better Quality)
If free proxies don't work, use paid services.
Popular Proxy Services
1. Bright Data (expensive, best quality)
- https://brightdata.com
- Cost: ~$500/month minimum
- Residential proxies
2. SmartProxy (good balance)
- https://smartproxy.com
- Cost: ~$75/month for 5GB
- Residential proxies
3. ProxyMesh (simple, cheap)
- https://proxymesh.com
- Cost: ~$10/month
- Datacenter proxies
Using Paid Proxy Service
Most services give you a single endpoint:
# Instead of rotating yourself
proxy = 'http://username:password@proxy.service.com:8080'
# In spider
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={'proxy': proxy}
)
The service rotates proxies automatically!
Testing Proxies
Before using proxies, test if they work:
Simple Test Script
import requests
def test_proxy(proxy):
"""Test if a proxy works"""
proxies = {
'http': proxy,
'https': proxy
}
try:
response = requests.get(
'http://httpbin.org/ip',
proxies=proxies,
timeout=5
)
if response.status_code == 200:
print(f"✓ {proxy} works!")
return True
else:
print(f"✗ {proxy} failed (status {response.status_code})")
return False
except Exception as e:
print(f"✗ {proxy} failed ({str(e)})")
return False
# Test your proxies
proxies = [
'http://45.76.97.183:8080',
'http://103.149.194.10:36107',
'http://195.158.14.118:3128',
]
working_proxies = []
for proxy in proxies:
if test_proxy(proxy):
working_proxies.append(proxy)
print(f"\n{len(working_proxies)} out of {len(proxies)} proxies work")
Only use proxies that pass the test!
Common Problems and Solutions
Problem 1: Proxy Doesn't Work
Error:
ProxyError: Cannot connect to proxy
Solutions:
- Proxy is dead (try another one)
- Wrong format (should be
http://IP:PORT) - Needs authentication (use
http://user:pass@IP:PORT)
Problem 2: Still Getting Blocked
Even with proxies, you get banned?
Reasons:
- Using same proxy too much (rotate more)
- No delays between requests (add
DOWNLOAD_DELAY) - Bad User-Agent (add realistic headers)
- Cookies tracking you (clear cookies between requests)
Solution:
# settings.py
DOWNLOAD_DELAY = 2 # Wait 2 seconds
RANDOMIZE_DOWNLOAD_DELAY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...'
COOKIES_ENABLED = False
Problem 3: Proxies Too Slow
Free proxies are very slow?
Solutions:
- Test proxies first, only use fast ones
- Increase timeout:
DOWNLOAD_TIMEOUT = 30 - Use paid proxies (much faster)
- Use more concurrent requests:
CONCURRENT_REQUESTS = 16
Problem 4: Authentication Required
Some proxies need username and password:
# Format: http://username:password@IP:PORT
proxy = 'http://myuser:mypass@123.45.67.89:8080'
# In spider
meta={'proxy': proxy}
Best Practices
1. Always Test Proxies First
Don't use proxies without testing:
# Test before adding to list
if test_proxy(proxy):
working_proxies.append(proxy)
2. Rotate Proxies
Don't use same proxy for all requests:
# Good: rotate
proxy = random.choice(proxy_list)
# Bad: always same
proxy = 'http://123.45.67.89:8080'
3. Add Delays Even With Proxies
Proxies don't mean you can spam:
# settings.py
DOWNLOAD_DELAY = 1
4. Monitor Proxy Performance
Track which proxies work best:
class ProxyStatsMiddleware:
def __init__(self):
self.stats = {}
def process_response(self, request, response, spider):
proxy = request.meta.get('proxy')
if proxy:
if proxy not in self.stats:
self.stats[proxy] = {'success': 0, 'fail': 0}
if response.status == 200:
self.stats[proxy]['success'] += 1
else:
self.stats[proxy]['fail'] += 1
return response
5. Have Backup Proxies
Always have more proxies than you need:
# Good: 20 proxies for scraping 100 pages
# Bad: 2 proxies for scraping 1000 pages
When You DON'T Need Proxies
Proxies aren't always necessary:
You DON'T need proxies if:
- Scraping less than 100 pages
- Website has no rate limiting
- You add proper delays
- Small personal project
- Website explicitly allows scraping
You DO need proxies if:
- Scraping thousands of pages
- Website blocks after few requests
- Need to bypass geo-restrictions
- Professional/commercial scraping
- Website has strict anti-bot
Free vs Paid: What to Choose?
Use Free Proxies When:
- Learning and practicing
- Testing your spider
- Small one-time projects
- Scraping <1000 pages
Use Paid Proxies When:
- Professional projects
- Scraping >10,000 pages
- Need reliability
- Time is valuable
- Can't afford to get blocked
My recommendation for beginners:
Start with free proxies for learning. When you need reliability, invest in paid proxies.
Complete Real Example
Here's everything together:
Project Structure
myproject/
├── scrapy.cfg
├── myproject/
│ ├── __init__.py
│ ├── settings.py
│ ├── middlewares.py
│ └── spiders/
│ └── product_spider.py
middlewares.py
import random
class RotateProxyMiddleware:
def __init__(self):
self.proxies = [
'http://45.76.97.183:8080',
'http://103.149.194.10:36107',
]
def process_request(self, request, spider):
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
spider.logger.info(f'Using proxy: {proxy}')
@classmethod
def from_crawler(cls, crawler):
return cls()
settings.py
# Proxy middleware
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotateProxyMiddleware': 350,
}
# Be polite
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
# Look like real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Don't save cookies
COOKIES_ENABLED = False
product_spider.py
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get()
}
# Follow next page
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
This spider:
- Rotates between proxies
- Adds delays
- Uses realistic headers
- Follows pagination
- Logs everything
Perfect!
Quick Reference
Add single proxy:
meta={'proxy': 'http://123.45.67.89:8080'}
Add proxy with auth:
meta={'proxy': 'http://user:pass@123.45.67.89:8080'}
Rotate proxies:
proxy = random.choice(proxy_list)
meta={'proxy': proxy}
Test proxy:
response = requests.get('http://httpbin.org/ip', proxies={'http': proxy})
Summary
What are proxies?
Intermediary servers that hide your real IP address.
Why use them?
- Avoid IP bans
- Bypass rate limits
- Access geo-restricted content
- Scrape at scale
Types:
- Free: For learning
- Paid Datacenter: For medium projects
- Residential: For serious projects
Basic usage in Scrapy:
meta={'proxy': 'http://IP:PORT'}
Rotating proxies:
proxy = random.choice(proxy_list)
meta={'proxy': proxy}
Best practices:
- Test proxies first
- Rotate proxies
- Add delays anyway
- Monitor performance
- Start with free, upgrade to paid when needed
Remember:
Proxies are a tool, not a license to spam. Always be respectful, add delays, and follow robots.txt even when using proxies.
Happy scraping! 🕷️
Top comments (0)