Lucas Gragg

Posted on Apr 19

My scraper returned 0 results for a week. The bug was one HTTP header.

#python #webscraping #requests #debugging

Posting this because the symptom was weird and the fix was one line, and I want the next person who hits it to find the answer faster than I did.

The symptom

I run a freelance-job scraper against two public sites: PeoplePerHour and Guru.com. Both public listing pages, both HTML, nothing fancy.

For seven days, the scraper logged:

[scrape] pph: parsed 0 job listings
[scrape] guru: parsed 0 job listings

No exceptions. No 4xx or 5xx responses. Both endpoints returned 200 OK. My regex over the page was just returning an empty list.

The diagnostic

First, I assumed the HTML structure changed. Pulled the page in a browser, viewed source, grabbed the actual anchor pattern. It matched what my regex expected.

Then I dumped the response body my scraper was getting:

r = requests.get(url, headers=HEADERS, timeout=15)
print(repr(r.text[:100]))

Output:

'\x8b\xa1\x03\x00\x11Zw\x1f\xc2\xa1...'

That's not HTML. That's compressed binary.

The bug

My headers had:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
}

That Accept-Encoding: gzip, deflate, br advertises that my client can decode three encodings: gzip, deflate, and brotli.

requests auto-decodes gzip and deflate. It does NOT auto-decode brotli unless you pip install brotli (or brotlicffi). Without that, r.text returns the raw brotli-compressed bytes decoded as latin-1, which looks like the mojibake above.

The server sees the br in my Accept-Encoding, picks brotli (it's more efficient than gzip, so modern servers prefer it), sends a brotli-encoded response, and my code fails silently because requests quietly passes through the undecoded bytes as r.text.

The fix

Two options:

Option 1 — install brotli support:

pip install brotli

Now requests knows how to decode all three encodings, and the server's response comes back as HTML like you expect.

Option 2 — don't advertise what you can't decode:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "identity",  # or just "gzip, deflate"
    "Accept-Language": "en-US,en;q=0.9",
}

identity means 'send me the raw bytes, no encoding'. Servers will honor it. Slightly more bandwidth, zero dependencies.

I went with Option 2 because the scrapers are lightweight and the bandwidth delta is not meaningful at my volume.

Verification

After the fix:

[scrape] pph: parsed 47 job listings
[scrape] guru: parsed 18 job listings

Zero to 65 jobs per scan. Exactly the same code, one header changed.

The broader lesson

The real bug here was my scraper. Not the header — the scraper.

It should have:

Detected that the response body didn't parse as HTML
Logged a distinguishable error
Refused to silently return an empty list

Instead it trusted r.text, ran a regex over gibberish, got zero matches, and cheerfully logged 'parsed 0 job listings' as if that were a normal outcome.

Here's the validation I added:

import requests

def fetch_html(url, headers):
    r = requests.get(url, headers=headers, timeout=15)
    r.raise_for_status()
    body = r.text
    # Sanity: response should look like HTML
    if "<html" not in body.lower() and "<body" not in body.lower():
        sample = repr(r.content[:40])
        raise ValueError(f"Response doesn't look like HTML. first 40 bytes: {sample}")
    return body

If I'd had that from day one, the bug would have been a single loud error in the log. Instead it was seven days of quiet failure.

Small gotcha, big impact

One header. Seven days. 65 jobs a scan I wasn't getting. The class of bugs where the symptom is 'nothing, except no results' is always worth a defensive check at the boundary — treat empty results as suspicious, not as a valid answer.

DEV Community