Scraper worked on my laptop. Deployed to server and got instant 403s.
Wrote a scraper last week for product data. Tested it locally, worked fine. Collected 200 products, zero issues. Deployed to my VPS Friday night thinking I could run it on a cron and forget about it.
Saturday morning I check the logs. Every single request: 403 Forbidden. Zero data collected.
Fun times.
What broke
Turns out the target site was checking User-Agent. My laptop had requests with a normal browser user agent because I was using Playwright for something else and had set it globally in my profile.
The server? Fresh Ubuntu install. Default Python requests User-Agent looks like this:
python-requests/2.31.0
Site took one look at that and said no thanks.
Fixed it
Added a custom User-Agent to the requests header:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get('https://example.com/products', headers=headers)
if response.status_code == 200:
# Parse the data
products = response.json()
else:
print(f"Failed: {response.status_code}")
That fixed it. Site started returning 200s again.
Other things that sometimes matter
Besides User-Agent, sites sometimes check:
Referer header. Some sites want to see where you came from. If you're hitting an API endpoint directly without browsing the site first, they block you.
headers = {
'User-Agent': 'Mozilla/5.0...',
'Referer': 'https://example.com/'
}
Accept headers. Real browsers send these. Scrapers often don't.
headers = {
'User-Agent': 'Mozilla/5.0...',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br'
}
Most of the time just User-Agent is enough. But when it's not, adding these usually works.
Still check response.status_code though. Saves you from weird parsing errors when the site just blocked you and you're trying to parse an error page as JSON.
Top comments (1)
Good info!