DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Why Scrapy Gets Blocked But Python Requests Works (And How to Fix It)

You're scraping a website with Scrapy. You've set everything up perfectly. User agent? Check. Headers? Check. Cookies? Check.

But you keep getting 403 errors. Or blank pages. Or weird responses that make no sense.

Frustrated, you decide to test the same URL with Python's requests library. Just to see if the site is even working.

And it works perfectly. First try.

What the hell?

Same URL. Same headers. Same user agent. How is requests working when Scrapy isn't?

I've been there. I've lost hours debugging this exact problem. And the answer isn't what you'd expect. Let me show you what's really happening under the hood.


The Problem: They're Not Actually Sending the Same Request

Here's what blows most people's minds: even when you think you've configured Scrapy and requests identically, they send different HTTP requests.

Let me show you.

What You Think You're Sending

You set up your headers like this in Scrapy:

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'accept': 'text/html,application/xhtml+xml',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache'
}

yield scrapy.Request(url='https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

And like this in requests:

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    'accept': 'text/html,application/xhtml+xml',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache'
}

response = requests.get('https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

Looks identical, right?

Wrong.


The Hidden Differences

Let me break down exactly what's different between these two requests.

Difference #1: Scrapy Capitalizes Everything

This is the biggest gotcha.

What you set:

headers = {
    'user-agent': 'Mozilla/5.0',
    'accept': 'text/html',
    'cache-control': 'no-cache'
}
Enter fullscreen mode Exit fullscreen mode

What Scrapy actually sends:

User-Agent: Mozilla/5.0
Accept: text/html
Cache-Control: no-cache
Enter fullscreen mode Exit fullscreen mode

Notice? Scrapy automatically capitalizes every header. Your lowercase user-agent becomes User-Agent. Your cache-control becomes Cache-Control.

What requests sends:

user-agent: Mozilla/5.0
accept: text/html
cache-control: no-cache
Enter fullscreen mode Exit fullscreen mode

Requests keeps your headers exactly as you wrote them.

Why This Matters

"But wait," you're thinking, "HTTP headers are case-insensitive! The spec says so!"

You're right. According to the HTTP specification, Accept and accept should be treated identically.

But here's the dirty secret: not every website follows the spec.

Some websites:

  • Use header capitalization to fingerprint bots
  • Have buggy implementations that expect specific casing
  • Check for capital letters as a bot detection signal

I've seen websites that specifically look for capitalized headers and block requests with them. Why? Because real browsers typically send lowercase headers in HTTP/2, while many bots and tools still send capitalized headers.

If your target website is one of these, Scrapy will get blocked while requests works fine.


Difference #2: Scrapy Adds Extra Headers You Didn't Ask For

Even if you don't specify certain headers, Scrapy adds them automatically.

You write this:

headers = {'user-agent': 'Mozilla/5.0'}

yield scrapy.Request(url='https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

Scrapy actually sends:

User-Agent: Mozilla/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en
Enter fullscreen mode Exit fullscreen mode

See that Accept-Language: en? You never asked for it. Scrapy added it automatically.

With requests:

headers = {'user-agent': 'Mozilla/5.0'}

response = requests.get('https://example.com', headers=headers)
Enter fullscreen mode Exit fullscreen mode

You only get what you specified. No surprises.

Why This Is a Problem

Let's say you're trying to replicate an exact request. You've captured a real browser's headers using developer tools. You copy them carefully into your Scrapy spider.

But Scrapy adds extra headers on top of yours. Now your request doesn't match the fingerprint you're trying to mimic. The website notices the mismatch and blocks you.

Meanwhile, requests sends exactly what you specified, matches the fingerprint perfectly, and works.


Difference #3: They Use Different HTTP Libraries

This is where it gets technical, but stick with me.

Scrapy uses Twisted

Scrapy is built on top of a networking library called Twisted. Specifically, it uses Twisted's HTTP client to make requests.

Requests uses urllib3

The requests library uses urllib3 under the hood, which is built on Python's standard HTTP client.

Why this matters:

These libraries format HTTP requests slightly differently. Even with identical headers, the actual bytes sent over the network can differ in:

  • Header order
  • Line endings
  • Whitespace
  • Connection handling

Sophisticated websites can detect these differences.


Difference #4: Header Order Is Different

Here's something most people don't know: the order of HTTP headers matters to some websites.

Scrapy might send:

User-Agent: Mozilla/5.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Accept-Language: en
Host: example.com
Enter fullscreen mode Exit fullscreen mode

Requests might send:

Host: example.com
User-Agent: Mozilla/5.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Enter fullscreen mode Exit fullscreen mode

Different order. Same headers.

Websites that fingerprint requests based on header order will see these as different clients.


Difference #5: Cookie Handling Is Different

Scrapy and requests handle cookies very differently.

In Scrapy:

Scrapy has a CookiesMiddleware that automatically manages cookies for you. When you set cookies, you have to do it like this:

yield scrapy.Request(
    url='https://example.com',
    cookies={'session': 'abc123'}
)
Enter fullscreen mode Exit fullscreen mode

If you try to set cookies via headers like this:

yield scrapy.Request(
    url='https://example.com',
    headers={'Cookie': 'session=abc123'}  # This doesn't work!
)
Enter fullscreen mode Exit fullscreen mode

It gets ignored! Scrapy's CookiesMiddleware takes over and manages cookies separately.

In requests:

Both methods work:

# Method 1
response = requests.get('https://example.com', cookies={'session': 'abc123'})

# Method 2 (also works!)
response = requests.get('https://example.com', headers={'Cookie': 'session=abc123'})
Enter fullscreen mode Exit fullscreen mode

If you're trying to replicate a specific request with cookies in the headers, Scrapy might not send them the way you expect.


A Real Example: The Mysterious 403

Let me tell you about a time this happened to me.

I was scraping a retail website. My Scrapy spider kept getting 403 Forbidden errors. I checked everything:

  • User agent looked good
  • Headers seemed fine
  • No robots.txt restrictions
  • The site wasn't even using JavaScript

Out of desperation, I copied the URL and tried it with requests. Worked perfectly. Same IP, same everything.

I spent three hours debugging before I finally captured the actual HTTP requests and compared them side by side.

Here's what I found:

My requests code (worked):

GET /products HTTP/1.1
host: example.com
user-agent: Mozilla/5.0
accept: text/html
Enter fullscreen mode Exit fullscreen mode

My Scrapy code (blocked):

GET /products HTTP/1.1
User-Agent: Mozilla/5.0
Accept: text/html
Accept-Language: en
Host: example.com
Enter fullscreen mode Exit fullscreen mode

The differences:

  1. Capitalized headers (User-Agent vs user-agent)
  2. Extra Accept-Language header I never asked for
  3. Different header order

The website was looking for lowercase headers as a signal of a real browser. When it saw capitalized headers, it assumed "bot" and blocked me.

Once I understood this, I could fix it. But it took way too long to figure out.


The Solution

Now that you understand what's happening, here's how to fix it.

Control Header Capitalization (The Twisted Way)

Solution 7: Direct Twisted Case Mapping Control (The Surgical Approach)

Here's the most elegant solution that directly addresses Twisted's header capitalization at its source. This approach manipulates Twisted's internal case mappings to force specific headers to remain lowercase.

Add this to your spider initialization or settings file:

from twisted.web.http_headers import _NameEncoder as TwistedHeaders

# Override Twisted's capitalization mappings for specific headers
TwistedHeaders._caseMappings.update({
    b'access_token': b'access_token',      # Keep access_token lowercase
    b'user-agent': b'user-agent',          # Keep user-agent lowercase
    b'accept': b'accept',                  # Keep accept lowercase
    b'content-type': b'content-type',      # Keep content-type lowercase
})
Enter fullscreen mode Exit fullscreen mode

How it works:

Twisted's _NameEncoder class has an internal _caseMappings dictionary that maps lowercase header names to their "canonical" capitalized form (e.g., b'user-agent'b'User-Agent'). By updating this dictionary, you're telling Twisted: "When you see this header, don't capitalize it—keep it exactly as I specify."

Example usage in a spider:

from twisted.web.http_headers import _NameEncoder as TwistedHeaders
import scrapy

# Configure case mappings before spider runs
TwistedHeaders._caseMappings.update({
    b'access_token': b'access_token',
    b'authorization': b'authorization',
})

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['https://api.example.com']

    def start_requests(self):
        headers = {
            'access_token': 'secret123',
            'authorization': 'Bearer token',
            'user-agent': 'MyBot/1.0'
        }

        for url in self.start_urls:
            yield scrapy.Request(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode

What actually gets sent:

access_token: secret123
authorization: Bearer token
User-Agent: MyBot/1.0
Enter fullscreen mode Exit fullscreen mode

Notice access_token and authorization stayed lowercase (because we mapped them), while user-agent got capitalized (because we didn't map it).

When to use this solution:

  • You need precise control over specific header casing
  • You're dealing with APIs that require exact lowercase headers
  • You want a global solution that applies to all requests
  • You need to bypass Twisted's capitalization for specific headers only

Advantages:

  • Clean and surgical—only affects the headers you specify
  • No middleware needed
  • Works globally across your entire Scrapy project
  • Doesn't break other Scrapy functionality

* Minimal performance impact

When to Use What

After all this, here's my advice on which tool to use:

Use Python requests when:

  • You need exact control over every header
  • You're making simple, one-off requests
  • The site is extremely picky about request formatting
  • You've tried Scrapy and can't make it work

Use Scrapy when:

  • You're crawling multiple pages
  • You need advanced features (pipelines, middlewares)
  • Performance and concurrency matter
  • The site isn't too sensitive about request details

Use the hybrid approach when:

  • Scrapy gets blocked but requests works
  • You need requests' compatibility with Scrapy's power
  • You want the best of both worlds

Final Thoughts

The difference between Scrapy and requests isn't just configuration. They're fundamentally different tools that make HTTP requests in different ways.

Understanding these differences saves you hours of debugging. Now when Scrapy gets blocked and requests works, you'll know exactly where to look.

Key takeaways:

  • Scrapy capitalizes headers (requests doesn't)
  • Scrapy adds extra headers automatically
  • They use different HTTP libraries (Twisted vs urllib3)
  • Cookie handling is different
  • You can override Twisted's case mappings directly for surgical control
  • Sometimes the best solution is using both together

Don't fight it. Use the right tool for the job. And when all else fails, there's no shame in using requests inside a Scrapy spider.

Happy scraping! 🕷️

Top comments (0)