DEV Community

John Rooney for Extract by Zyte

Posted on

Why Your Scraper Works in the Browser But Fails in Python

When a requests.get() call returns a 403, or a 200 with an "Access Denied" body, the first instinct is usually to blame the site. But the more likely explanation is that the server received a request that doesn't look anything like what a browser sends — and responded accordingly.

HTTP servers see every header your client sends. A bare requests call sends four. Chrome sends around fifteen, and the values are specific enough that the gap is obvious server-side. This post covers what that gap looks like, why it matters, and how to close it.


What requests actually sends by default

Start a fresh Python session and inspect what requests puts on the wire:

import requests

session = requests.Session()
req = requests.Request("GET", "https://example.com")
prepared = session.prepare_request(req)

for header, value in prepared.headers.items():
    print(f"{header}: {value}")
Enter fullscreen mode Exit fullscreen mode

Output:

User-Agent: python-requests/2.33.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Enter fullscreen mode Exit fullscreen mode

Four headers. That's it. A real Chrome browser on the same request would send around 15-20, and the content of each one is meaningfully different.

The User-Agent is the most obvious signal — python-requests/2.33.1 is not subtle — but it's also the least interesting one to fixate on, because it's rarely the only reason a request fails. The deeper issue is the overall fingerprint: the combination of which headers are present, in what order, and with what values.


What a browser actually sends

For a standard top-level page navigation, Chrome sends something like this:

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate, br
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Connection: keep-alive
Enter fullscreen mode Exit fullscreen mode

A few things stand out:

Accept is specific. The browser advertises exactly which content types it can handle, with quality weights (q=0.9). The requests default of */* says "anything goes," which is an unusual declaration for something claiming to be a browser.

Sec-Fetch-* headers are a family added by Chrome in 2019. Sec-Fetch-Dest: document says the request is fetching a top-level document. Sec-Fetch-Mode: navigate says it's a user-initiated navigation. Sec-Fetch-Site: none says there's no referring site (i.e., it was typed directly or bookmarked). These headers don't affect what most sites return, but sites that check for them will immediately identify an absent set as non-browser traffic.

Accept-Language identifies the browser's locale. Absent entirely in a raw requests call.


Building a session with browser-like headers

The right approach is to set your headers once on a Session object, not on every individual request. Sessions also handle cookies automatically across requests, which matters as soon as you need to scrape pages behind a login or track state.

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

session = requests.Session()
session.headers.update(HEADERS)
Enter fullscreen mode Exit fullscreen mode

With this session, every session.get() and session.post() call will include these headers automatically. You can override individual headers per-request by passing a headers dict to the call — the per-request dict is merged with the session headers, with the per-request values winning on collision:

# Adds Referer for this request only; all other session headers still apply
resp = session.get(
    "https://example.com/products/",
    headers={"Referer": "https://example.com/"},
)
Enter fullscreen mode Exit fullscreen mode

A working example

books.toscrape.com is a site built specifically for scraping practice. Here's a complete working scraper using the session approach:

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

session = requests.Session()
session.headers.update(HEADERS)

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.raise_for_status()
resp.encoding = "utf-8"  # server omits charset in Content-Type; set it explicitly

soup = BeautifulSoup(resp.text, "html.parser")
books = soup.find_all("article", class_="product_pod")

for book in books:
    title  = book.find("h3").find("a")["title"]
    price  = book.find("p", class_="price_color").text.strip()
    rating = book.find("p", class_="star-rating")["class"][1]
    print(f"{title} | {price} | {rating} stars")
Enter fullscreen mode Exit fullscreen mode

Output:

A Light in the Attic | £51.77 | Three stars
Tipping the Velvet | £53.74 | One stars
Soumission | £50.10 | One stars
Sharp Objects | £47.82 | Four stars
Sapiens: A Brief History of Humankind | £54.23 | Five stars
...
Enter fullscreen mode Exit fullscreen mode

One note on resp.encoding = "utf-8": when a server sends Content-Type: text/html without a charset parameter, requests defaults to ISO-8859-1 per the HTTP spec. That produces garbled currency symbols (£ becomes £). Setting it explicitly to UTF-8 before accessing resp.text fixes it. Alternatively, pass resp.content (raw bytes) to BeautifulSoup and let it detect the encoding from the HTML meta tags — but that can fail on malformed pages, so explicit is safer.


When you need to scrape multiple pages concurrently

If you're pulling many pages and speed matters, httpx gives you an async-compatible API with the same session model. The header setup is identical; you just swap the client:

import asyncio
import httpx
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

async def fetch_pages(urls: list[str]) -> list[httpx.Response]:
    async with httpx.AsyncClient(
        headers=HEADERS,
        timeout=15,
        follow_redirects=True,
    ) as client:
        tasks = [client.get(url) for url in urls]
        return await asyncio.gather(*tasks)

urls = [
    "https://books.toscrape.com/catalogue/page-1.html",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
]

responses = asyncio.run(fetch_pages(urls))

for url, resp in zip(urls, responses):
    soup = BeautifulSoup(resp.text, "html.parser")
    books = soup.find_all("article", class_="product_pod")
    print(f"{url} -> {len(books)} books")
Enter fullscreen mode Exit fullscreen mode

Output:

https://books.toscrape.com/catalogue/page-1.html -> 20 books
https://books.toscrape.com/catalogue/page-2.html -> 20 books
https://books.toscrape.com/catalogue/page-3.html -> 20 books
Enter fullscreen mode Exit fullscreen mode

httpx.AsyncClient fetches all three pages concurrently rather than sequentially. For 3 pages the difference is negligible; for 50 it's significant.

One thing to watch: asyncio.gather() will raise an exception if any request fails. In production you'd want to wrap each client.get() call in a try/except or use a return_exceptions=True argument to gather() so a single failed request doesn't kill the whole batch.


A helper to catch silent failures

A 200 response doesn't mean you got what you wanted. Some sites return 200 with a block page or a challenge in the body. Add a quick check to your fetch function:

def fetch(session: requests.Session, url: str, **kwargs) -> requests.Response:
    resp = session.get(url, timeout=15, **kwargs)
    resp.raise_for_status()

    body = resp.text.lower()
    if "access denied" in body or "captcha" in body or "enable javascript" in body:
        raise ValueError(f"Block signal detected in response body for {url}")

    return resp
Enter fullscreen mode Exit fullscreen mode

This won't catch every case — some challenge pages use different wording, and JavaScript-rendered content simply won't appear in the HTML at all — but it catches the obvious ones early and turns a silent data quality problem into a loud error.


Quick reference: headers you should always set

Header Why it matters
User-Agent Identifies the client; python-requests/x.x is an instant flag
Accept Advertises supported content types; */* is unusual for browser traffic
Accept-Language Absent in raw requests; browsers always send it
Upgrade-Insecure-Requests Tells the server you prefer HTTPS; browsers send this on HTTP requests
Sec-Fetch-Dest Part of the Fetch metadata spec; absent headers are a signal
Sec-Fetch-Mode As above
Sec-Fetch-Site As above
Sec-Fetch-User Indicates user-initiated navigation

Copy the values directly from your browser's DevTools for the site you're targeting. The values above are correct for Chrome on Linux navigating to a top-level page; they differ slightly for XHR requests, POST submissions, and sub-resource loads (images, scripts).


What this doesn't solve

Better headers get you further than the bare requests default, but they're not a complete solution. Two scenarios they won't help with:

JavaScript-rendered content. If the data you want is injected by JavaScript after the initial HTML loads, requests and httpx will never see it. The HTML response simply won't contain those elements. The fix for that is a real browser via Playwright or Selenium — or finding the underlying API call the JavaScript is making (often easier and more reliable). That's what the next post in this series covers.

TLS fingerprinting. The HTTP headers your scraper sends are one signal; the TLS handshake is another. Some detection systems check the cipher suite order and TLS extension profile your client presents, which differs from Chrome's even if your headers match exactly. requests uses Python's ssl module, which has a distinct fingerprint. Addressing that requires a library like curl_cffi that wraps libcurl's TLS stack and can impersonate Chrome's handshake.

For most sites and most scraping tasks, the session approach above is enough to get started. When it isn't, the failure mode is usually clear: you'll get 403s, CAPTCHAs, or suspiciously empty responses regardless of how you set your headers.

Tags: python scrapy webscraping tutorial

Top comments (0)