You're scraping a website with Scrapy. You've set everything up perfectly. User agent? Check. Headers? Check. Cookies? Check.
But you keep getting 403 errors. Or blank pages. Or weird responses that make no sense.
Frustrated, you decide to test the same URL with Python's requests library. Just to see if the site is even working.
And it works perfectly. First try.
What the hell?
Same URL. Same headers. Same user agent. How is requests working when Scrapy isn't?
I've been there. I've lost hours debugging this exact problem. And the answer isn't what you'd expect. Let me show you what's really happening under the hood.
The Problem: They're Not Actually Sending the Same Request
Here's what blows most people's minds: even when you think you've configured Scrapy and requests identically, they send different HTTP requests.
Let me show you.
What You Think You're Sending
You set up your headers like this in Scrapy:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'accept': 'text/html,application/xhtml+xml',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache'
}
yield scrapy.Request(url='https://example.com', headers=headers)
And like this in requests:
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'accept': 'text/html,application/xhtml+xml',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache'
}
response = requests.get('https://example.com', headers=headers)
Looks identical, right?
Wrong.
The Hidden Differences
Let me break down exactly what's different between these two requests.
Difference #1: Scrapy Capitalizes Everything
This is the biggest gotcha.
What you set:
headers = {
'user-agent': 'Mozilla/5.0',
'accept': 'text/html',
'cache-control': 'no-cache'
}
What Scrapy actually sends:
User-Agent: Mozilla/5.0
Accept: text/html
Cache-Control: no-cache
Notice? Scrapy automatically capitalizes every header. Your lowercase user-agent becomes User-Agent. Your cache-control becomes Cache-Control.
What requests sends:
user-agent: Mozilla/5.0
accept: text/html
cache-control: no-cache
Requests keeps your headers exactly as you wrote them.
Why This Matters
"But wait," you're thinking, "HTTP headers are case-insensitive! The spec says so!"
You're right. According to the HTTP specification, Accept and accept should be treated identically.
But here's the dirty secret: not every website follows the spec.
Some websites:
- Use header capitalization to fingerprint bots
- Have buggy implementations that expect specific casing
- Check for capital letters as a bot detection signal
I've seen websites that specifically look for capitalized headers and block requests with them. Why? Because real browsers typically send lowercase headers in HTTP/2, while many bots and tools still send capitalized headers.
If your target website is one of these, Scrapy will get blocked while requests works fine.
Difference #2: Scrapy Adds Extra Headers You Didn't Ask For
Even if you don't specify certain headers, Scrapy adds them automatically.
You write this:
headers = {'user-agent': 'Mozilla/5.0'}
yield scrapy.Request(url='https://example.com', headers=headers)
Scrapy actually sends:
User-Agent: Mozilla/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en
See that Accept-Language: en? You never asked for it. Scrapy added it automatically.
With requests:
headers = {'user-agent': 'Mozilla/5.0'}
response = requests.get('https://example.com', headers=headers)
You only get what you specified. No surprises.
Why This Is a Problem
Let's say you're trying to replicate an exact request. You've captured a real browser's headers using developer tools. You copy them carefully into your Scrapy spider.
But Scrapy adds extra headers on top of yours. Now your request doesn't match the fingerprint you're trying to mimic. The website notices the mismatch and blocks you.
Meanwhile, requests sends exactly what you specified, matches the fingerprint perfectly, and works.
Difference #3: They Use Different HTTP Libraries
This is where it gets technical, but stick with me.
Scrapy uses Twisted
Scrapy is built on top of a networking library called Twisted. Specifically, it uses Twisted's HTTP client to make requests.
Requests uses urllib3
The requests library uses urllib3 under the hood, which is built on Python's standard HTTP client.
Why this matters:
These libraries format HTTP requests slightly differently. Even with identical headers, the actual bytes sent over the network can differ in:
- Header order
- Line endings
- Whitespace
- Connection handling
Sophisticated websites can detect these differences.
Difference #4: Header Order Is Different
Here's something most people don't know: the order of HTTP headers matters to some websites.
Scrapy might send:
User-Agent: Mozilla/5.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Accept-Language: en
Host: example.com
Requests might send:
Host: example.com
User-Agent: Mozilla/5.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Different order. Same headers.
Websites that fingerprint requests based on header order will see these as different clients.
Difference #5: Cookie Handling Is Different
Scrapy and requests handle cookies very differently.
In Scrapy:
Scrapy has a CookiesMiddleware that automatically manages cookies for you. When you set cookies, you have to do it like this:
yield scrapy.Request(
url='https://example.com',
cookies={'session': 'abc123'}
)
If you try to set cookies via headers like this:
yield scrapy.Request(
url='https://example.com',
headers={'Cookie': 'session=abc123'} # This doesn't work!
)
It gets ignored! Scrapy's CookiesMiddleware takes over and manages cookies separately.
In requests:
Both methods work:
# Method 1
response = requests.get('https://example.com', cookies={'session': 'abc123'})
# Method 2 (also works!)
response = requests.get('https://example.com', headers={'Cookie': 'session=abc123'})
If you're trying to replicate a specific request with cookies in the headers, Scrapy might not send them the way you expect.
A Real Example: The Mysterious 403
Let me tell you about a time this happened to me.
I was scraping a retail website. My Scrapy spider kept getting 403 Forbidden errors. I checked everything:
- User agent looked good
- Headers seemed fine
- No robots.txt restrictions
- The site wasn't even using JavaScript
Out of desperation, I copied the URL and tried it with requests. Worked perfectly. Same IP, same everything.
I spent three hours debugging before I finally captured the actual HTTP requests and compared them side by side.
Here's what I found:
My requests code (worked):
GET /products HTTP/1.1
host: example.com
user-agent: Mozilla/5.0
accept: text/html
My Scrapy code (blocked):
GET /products HTTP/1.1
User-Agent: Mozilla/5.0
Accept: text/html
Accept-Language: en
Host: example.com
The differences:
- Capitalized headers (User-Agent vs user-agent)
- Extra Accept-Language header I never asked for
- Different header order
The website was looking for lowercase headers as a signal of a real browser. When it saw capitalized headers, it assumed "bot" and blocked me.
Once I understood this, I could fix it. But it took way too long to figure out.
The Solution
Now that you understand what's happening, here's how to fix it.
Control Header Capitalization (The Twisted Way)
Solution 7: Direct Twisted Case Mapping Control (The Surgical Approach)
Here's the most elegant solution that directly addresses Twisted's header capitalization at its source. This approach manipulates Twisted's internal case mappings to force specific headers to remain lowercase.
Add this to your spider initialization or settings file:
from twisted.web.http_headers import _NameEncoder as TwistedHeaders
# Override Twisted's capitalization mappings for specific headers
TwistedHeaders._caseMappings.update({
b'access_token': b'access_token', # Keep access_token lowercase
b'user-agent': b'user-agent', # Keep user-agent lowercase
b'accept': b'accept', # Keep accept lowercase
b'content-type': b'content-type', # Keep content-type lowercase
})
How it works:
Twisted's _NameEncoder class has an internal _caseMappings dictionary that maps lowercase header names to their "canonical" capitalized form (e.g., b'user-agent' → b'User-Agent'). By updating this dictionary, you're telling Twisted: "When you see this header, don't capitalize it—keep it exactly as I specify."
Example usage in a spider:
from twisted.web.http_headers import _NameEncoder as TwistedHeaders
import scrapy
# Configure case mappings before spider runs
TwistedHeaders._caseMappings.update({
b'access_token': b'access_token',
b'authorization': b'authorization',
})
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://api.example.com']
def start_requests(self):
headers = {
'access_token': 'secret123',
'authorization': 'Bearer token',
'user-agent': 'MyBot/1.0'
}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
What actually gets sent:
access_token: secret123
authorization: Bearer token
User-Agent: MyBot/1.0
Notice access_token and authorization stayed lowercase (because we mapped them), while user-agent got capitalized (because we didn't map it).
When to use this solution:
- You need precise control over specific header casing
- You're dealing with APIs that require exact lowercase headers
- You want a global solution that applies to all requests
- You need to bypass Twisted's capitalization for specific headers only
Advantages:
- Clean and surgical—only affects the headers you specify
- No middleware needed
- Works globally across your entire Scrapy project
- Doesn't break other Scrapy functionality
* Minimal performance impact
When to Use What
After all this, here's my advice on which tool to use:
Use Python requests when:
- You need exact control over every header
- You're making simple, one-off requests
- The site is extremely picky about request formatting
- You've tried Scrapy and can't make it work
Use Scrapy when:
- You're crawling multiple pages
- You need advanced features (pipelines, middlewares)
- Performance and concurrency matter
- The site isn't too sensitive about request details
Use the hybrid approach when:
- Scrapy gets blocked but requests works
- You need requests' compatibility with Scrapy's power
- You want the best of both worlds
Final Thoughts
The difference between Scrapy and requests isn't just configuration. They're fundamentally different tools that make HTTP requests in different ways.
Understanding these differences saves you hours of debugging. Now when Scrapy gets blocked and requests works, you'll know exactly where to look.
Key takeaways:
- Scrapy capitalizes headers (requests doesn't)
- Scrapy adds extra headers automatically
- They use different HTTP libraries (Twisted vs urllib3)
- Cookie handling is different
- You can override Twisted's case mappings directly for surgical control
- Sometimes the best solution is using both together
Don't fight it. Use the right tool for the job. And when all else fails, there's no shame in using requests inside a Scrapy spider.
Happy scraping! 🕷️
Top comments (0)