DEV Community

Cover image for My scraper returned 0 results for a week. The bug was one HTTP header.
Lucas Gragg
Lucas Gragg

Posted on

My scraper returned 0 results for a week. The bug was one HTTP header.

Posting this because the symptom was weird and the fix was one line, and I want the next person who hits it to find the answer faster than I did.

The symptom

I run a freelance-job scraper against two public sites: PeoplePerHour and Guru.com. Both public listing pages, both HTML, nothing fancy.

For seven days, the scraper logged:

[scrape] pph: parsed 0 job listings
[scrape] guru: parsed 0 job listings
Enter fullscreen mode Exit fullscreen mode

No exceptions. No 4xx or 5xx responses. Both endpoints returned 200 OK. My regex over the page was just returning an empty list.

The diagnostic

First, I assumed the HTML structure changed. Pulled the page in a browser, viewed source, grabbed the actual anchor pattern. It matched what my regex expected.

Then I dumped the response body my scraper was getting:

r = requests.get(url, headers=HEADERS, timeout=15)
print(repr(r.text[:100]))
Enter fullscreen mode Exit fullscreen mode

Output:

'\x8b\xa1\x03\x00\x11Zw\x1f\xc2\xa1...'
Enter fullscreen mode Exit fullscreen mode

That's not HTML. That's compressed binary.

The bug

My headers had:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9",
}
Enter fullscreen mode Exit fullscreen mode

That Accept-Encoding: gzip, deflate, br advertises that my client can decode three encodings: gzip, deflate, and brotli.

requests auto-decodes gzip and deflate. It does NOT auto-decode brotli unless you pip install brotli (or brotlicffi). Without that, r.text returns the raw brotli-compressed bytes decoded as latin-1, which looks like the mojibake above.

The server sees the br in my Accept-Encoding, picks brotli (it's more efficient than gzip, so modern servers prefer it), sends a brotli-encoded response, and my code fails silently because requests quietly passes through the undecoded bytes as r.text.

The fix

Two options:

Option 1 — install brotli support:

pip install brotli
Enter fullscreen mode Exit fullscreen mode

Now requests knows how to decode all three encodings, and the server's response comes back as HTML like you expect.

Option 2 — don't advertise what you can't decode:

HEADERS = {
    "User-Agent": "Mozilla/5.0 ...",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "identity",  # or just "gzip, deflate"
    "Accept-Language": "en-US,en;q=0.9",
}
Enter fullscreen mode Exit fullscreen mode

identity means 'send me the raw bytes, no encoding'. Servers will honor it. Slightly more bandwidth, zero dependencies.

I went with Option 2 because the scrapers are lightweight and the bandwidth delta is not meaningful at my volume.

Verification

After the fix:

[scrape] pph: parsed 47 job listings
[scrape] guru: parsed 18 job listings
Enter fullscreen mode Exit fullscreen mode

Zero to 65 jobs per scan. Exactly the same code, one header changed.

The broader lesson

The real bug here was my scraper. Not the header — the scraper.

It should have:

  1. Detected that the response body didn't parse as HTML
  2. Logged a distinguishable error
  3. Refused to silently return an empty list

Instead it trusted r.text, ran a regex over gibberish, got zero matches, and cheerfully logged 'parsed 0 job listings' as if that were a normal outcome.

Here's the validation I added:

import requests

def fetch_html(url, headers):
    r = requests.get(url, headers=headers, timeout=15)
    r.raise_for_status()
    body = r.text
    # Sanity: response should look like HTML
    if "<html" not in body.lower() and "<body" not in body.lower():
        sample = repr(r.content[:40])
        raise ValueError(f"Response doesn't look like HTML. first 40 bytes: {sample}")
    return body
Enter fullscreen mode Exit fullscreen mode

If I'd had that from day one, the bug would have been a single loud error in the log. Instead it was seven days of quiet failure.

Small gotcha, big impact

One header. Seven days. 65 jobs a scan I wasn't getting. The class of bugs where the symptom is 'nothing, except no results' is always worth a defensive check at the boundary — treat empty results as suspicious, not as a valid answer.

Top comments (0)