DEV Community

Nico Reyes
Nico Reyes

Posted on

Scraper worked on my laptop. Deployed to server and got instant 403s.

Scraper worked on my laptop. Deployed to server and got instant 403s.

Wrote a scraper last week for product data. Tested it locally, worked fine. Collected 200 products, zero issues. Deployed to my VPS Friday night thinking I could run it on a cron and forget about it.

Saturday morning I check the logs. Every single request: 403 Forbidden. Zero data collected.

Fun times.

What broke

Turns out the target site was checking User-Agent. My laptop had requests with a normal browser user agent because I was using Playwright for something else and had set it globally in my profile.

The server? Fresh Ubuntu install. Default Python requests User-Agent looks like this:

python-requests/2.31.0
Enter fullscreen mode Exit fullscreen mode

Site took one look at that and said no thanks.

Fixed it

Added a custom User-Agent to the requests header:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get('https://example.com/products', headers=headers)

if response.status_code == 200:
    # Parse the data
    products = response.json()
else:
    print(f"Failed: {response.status_code}")
Enter fullscreen mode Exit fullscreen mode

That fixed it. Site started returning 200s again.

Other things that sometimes matter

Besides User-Agent, sites sometimes check:

Referer header. Some sites want to see where you came from. If you're hitting an API endpoint directly without browsing the site first, they block you.

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Referer': 'https://example.com/'
}
Enter fullscreen mode Exit fullscreen mode

Accept headers. Real browsers send these. Scrapers often don't.

headers = {
    'User-Agent': 'Mozilla/5.0...',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br'
}
Enter fullscreen mode Exit fullscreen mode

Most of the time just User-Agent is enough. But when it's not, adding these usually works.

Still check response.status_code though. Saves you from weird parsing errors when the site just blocked you and you're trying to parse an error page as JSON.

Top comments (1)

Collapse
 
sreno77 profile image
Scott Reno

Good info!