Scraper worked on my laptop. Deployed to server and got instant 403s.

#webscraping #python #tutorial #debugging

Scraper worked on my laptop. Deployed to server and got instant 403s.

Wrote a scraper last week for product data. Tested it locally, worked fine. Collected 200 products, zero issues. Deployed to my VPS Friday night thinking I could run it on a cron and forget about it.

Saturday morning I check the logs. Every single request: 403 Forbidden. Zero data collected.

Fun times.

What broke

Turns out the target site was checking User-Agent. My laptop had requests with a normal browser user agent because I was using Playwright for something else and had set it globally in my profile.

The server? Fresh Ubuntu install. Default Python requests User-Agent looks like this:

python-requests/2.31.0

Site took one look at that and said no thanks.

Fixed it

Added a custom User-Agent to the requests header:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get('https://example.com/products', headers=headers)

if response.status_code == 200:
    # Parse the data
    products = response.json()
else:
    print(f"Failed: {response.status_code}")

That fixed it. Site started returning 200s again.

Other things that sometimes matter

Besides User-Agent, sites sometimes check:

Referer header. Some sites want to see where you came from. If you're hitting an API endpoint directly without browsing the site first, they block you.

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://example.com/'
}

Accept headers. Real browsers send these. Scrapers often don't.

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br'
}

Most of the time just User-Agent is enough. But when it's not, adding these usually works.

Still check response.status_code though. Saves you from weird parsing errors when the site just blocked you and you're trying to parse an error page as JSON.

Top comments (1)

Scott Reno • Mar 31

Good info!