Posting this because the symptom was weird and the fix was one line, and I want the next person who hits it to find the answer faster than I did.
The symptom
I run a freelance-job scraper against two public sites: PeoplePerHour and Guru.com. Both public listing pages, both HTML, nothing fancy.
For seven days, the scraper logged:
[scrape] pph: parsed 0 job listings
[scrape] guru: parsed 0 job listings
No exceptions. No 4xx or 5xx responses. Both endpoints returned 200 OK. My regex over the page was just returning an empty list.
The diagnostic
First, I assumed the HTML structure changed. Pulled the page in a browser, viewed source, grabbed the actual anchor pattern. It matched what my regex expected.
Then I dumped the response body my scraper was getting:
r = requests.get(url, headers=HEADERS, timeout=15)
print(repr(r.text[:100]))
Output:
'\x8b\xa1\x03\x00\x11Zw\x1f\xc2\xa1...'
That's not HTML. That's compressed binary.
The bug
My headers had:
HEADERS = {
"User-Agent": "Mozilla/5.0 ...",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
}
That Accept-Encoding: gzip, deflate, br advertises that my client can decode three encodings: gzip, deflate, and brotli.
requests auto-decodes gzip and deflate. It does NOT auto-decode brotli unless you pip install brotli (or brotlicffi). Without that, r.text returns the raw brotli-compressed bytes decoded as latin-1, which looks like the mojibake above.
The server sees the br in my Accept-Encoding, picks brotli (it's more efficient than gzip, so modern servers prefer it), sends a brotli-encoded response, and my code fails silently because requests quietly passes through the undecoded bytes as r.text.
The fix
Two options:
Option 1 — install brotli support:
pip install brotli
Now requests knows how to decode all three encodings, and the server's response comes back as HTML like you expect.
Option 2 — don't advertise what you can't decode:
HEADERS = {
"User-Agent": "Mozilla/5.0 ...",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "identity", # or just "gzip, deflate"
"Accept-Language": "en-US,en;q=0.9",
}
identity means 'send me the raw bytes, no encoding'. Servers will honor it. Slightly more bandwidth, zero dependencies.
I went with Option 2 because the scrapers are lightweight and the bandwidth delta is not meaningful at my volume.
Verification
After the fix:
[scrape] pph: parsed 47 job listings
[scrape] guru: parsed 18 job listings
Zero to 65 jobs per scan. Exactly the same code, one header changed.
The broader lesson
The real bug here was my scraper. Not the header — the scraper.
It should have:
- Detected that the response body didn't parse as HTML
- Logged a distinguishable error
- Refused to silently return an empty list
Instead it trusted r.text, ran a regex over gibberish, got zero matches, and cheerfully logged 'parsed 0 job listings' as if that were a normal outcome.
Here's the validation I added:
import requests
def fetch_html(url, headers):
r = requests.get(url, headers=headers, timeout=15)
r.raise_for_status()
body = r.text
# Sanity: response should look like HTML
if "<html" not in body.lower() and "<body" not in body.lower():
sample = repr(r.content[:40])
raise ValueError(f"Response doesn't look like HTML. first 40 bytes: {sample}")
return body
If I'd had that from day one, the bug would have been a single loud error in the log. Instead it was seven days of quiet failure.
Small gotcha, big impact
One header. Seven days. 65 jobs a scan I wasn't getting. The class of bugs where the symptom is 'nothing, except no results' is always worth a defensive check at the boundary — treat empty results as suspicious, not as a valid answer.
Top comments (0)