DEV Community

loading...

Careful when you Scrape

joks84 profile image Jelena ・8 min read

Did you ever run into something like this:

“Unfortunately, our system has blocked your request for this page. We had to do this because we routinely get crawlers and bots overloading our system....
...
...
The IP address you are using comes from a bad group of IP addresses: we had so much trouble with it that we’ve blocked the entire range...
...
...
We are sorry we had to do this, but we get so many badly behaved bots who absolutely slam our servers and ruin it for everyone else....”

We don't want this, right?

Some things we need to keep in mind when making scraping projects.
If we don’t pay attention to those things, we can be blocked by the website we are scraping or we can make the website we are scraping unstable (when sending tooooo many requests)...or both.
So we want to take all necessary steps to prevent that from happening.
But, remember...even if we take all the steps we can, there is still a possibility to get blocked by the website since technology is changing constantly.

There are things I’ve tried, there are things I haven’t tried.
So if any of you experimented with something else...feel free to add your experience in the comment section. I would be happy to read what you have discovered since there are (as usual) many possibilities.

IP Address

Some of you may use VPN, some of you may not.
In this particular situation, VPN helps you change your IP address.
Why is this relevant?
Occasionally, your IP address can be blocked by a website you want.
In some cases, the reason can be a suspicious activity you are doing (for example scraping).
In other cases, you are not the problem at all. But, ‘your IP address has been blocked’.

In those situations, you can just change the server on VPN, so your IP address changes as well and you can access the website you want.
But, what if someone really needs that website and doesn’t use VPN or isn’t into programming at all, yet the IP address of such user is blocked?

Because of these situations, good scraping practice is desirable – we don’t want to make problems.
So, while scraping, one of the things you can do is to change your IP address from time to time.
How you can implement this depends mostly on the project you are writing.

User-Agent

Let’s start with what is User-Agent.
When you are sending a request to a web page, the website sees the header of your request and in that header, your User-Agent is shown.
Basically, User-Agent helps servers and websites to identify you.
One of the roles of your User-Agent is to adjust the content of the website to your browser or a device you are using.

You can type in your browser: What is my User-Agent, and see what is yours.
That website sees as well.

So, why is this important?
When you, as a human being go to a website, the header is something like this (depending on a device you are using, browser, OS, etc):

Accept: text/html,....
Accept-Encoding: gzip,...
Accept-Language: en-US...
User-Agent: Mozilla/5.0 (X11; Linux x86_64) ....
SEC-CH-UA: "Google Chrome";v="89", ....
SEC-FETCH-SITE: none...
SEC-FETCH-USER: ?1...
Enter fullscreen mode Exit fullscreen mode

But, when you are scraping the website, this is not the case.
If you are scraping with Python (BeautifulSoup), likely, you will also use requests module to send a request to the desired web page.

In that case, the header will be different.
So let’s take an example from my previous post.

We have a file called scraper.py and a testing website to scrape:

import requests


page = requests.get("http://books.toscrape.com/")
Enter fullscreen mode Exit fullscreen mode

Now, in our terminal if we want to access page's headers we will get:

page.request.headers
>>> {'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Enter fullscreen mode Exit fullscreen mode

Here we can see that User-Agent is problematic.
This is what the website we are scraping sees.
Nowadays, websites are equipped to detect a scraper – and the website detected we are using Python library.
So the website knows there is no human behind that request.
If we leave it like this, we are one step closer to getting blocked.

What can we do?
Fake User-Agent.
Let's try like this:

import requests


user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
headers = {"User-Agent": user_agent}

page = requests.get("http://books.toscrape.com/", headers=headers)
Enter fullscreen mode Exit fullscreen mode

Cool.

Let's check what we get when we check headers in the terminal:

page.request.headers
>>>{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Enter fullscreen mode Exit fullscreen mode

Now, we can see that User-Agent doesn’t show we are using requests library. It seems like the actual person accessed the website.

So, what did we do?
Firstly, we created a variable called user_agent which contains User-Agent from an actual person/device.
Then, we created headers dictionary and provided user_agent we created previously and added it to a request our scraper is sending.

So, now, when we are sending the request to a website, proper User-Agent will be shown.

We just showed how we can fake a User-Agent from request’s headers.

A smart thing to do would be to fake the entire headers, and not just User-Agent.

Headers

Why we should fake the entire headers?

We have seen previously how the headers look like when there is a real person behind the request.

Also, we have seen how the headers look like when we are using the requests module.

And, we have seen how the headers look like when we faked only User-Agent.

Let’s not forget that the website can see it too and, usually, websites can block you because your request doesn’t look like a human request.

So now, we will fake the entire headers.

Let's change our code a bit more:

import requests

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0', 
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0'}

page = requests.get("http://books.toscrape.com/", headers=headers)
Enter fullscreen mode Exit fullscreen mode

Aaand, when we want to check what we are sending now to a web page we are scraping, we get:

page.request.headers
>>>{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Connection': 'keep-alive', 'Accept-Language': 'en-US,en;q=0.5', 'Upgrade-Insecure-Requests': '1', 'Cache-Control': 'max-age=0'}
Enter fullscreen mode Exit fullscreen mode

Here, we removed the user_agent variable, and changed the headers dictionary to include data that is more realistic for a human behind the request.

The content of headers defers depending on an OS, browser, device, etc...but this is something that should also be included so we could reduce the possibility to be blocked by a website.

Ideally, when we are scraping the web page, we will rotate the headers – meaning, whenever we are sending the request to a page, we send different headers.

Now we will make our code a bit prettier because we will include multiple headers which will be rotated.

Rotating Headers

Here we want to send different headers each time we send the request to a web page.
Let’s change our code:

import requests
import random


headers_list = [
    # 1. Headers
    {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0', 
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'Cache-Control': 'max-age=0'},
    # 2. Headers
    {'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
        'sec-ch-ua-mobile': '?0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/89.0.4389.82 Safari/537.36',
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,'
                  'image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'none',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-language': 'en-US,en;q=0.9'},
]

headers = random.choice(headers_list)
page = requests.get("http://books.toscrape.com/", headers=headers)
Enter fullscreen mode Exit fullscreen mode

Now, we have created headers_list.
Each time we are sending a request to a website, with the help of a random library, we are selecting a random header from our list and sending it along with our request.

When we have changed the code, let's check the headers:

page.request.headers
>>>{'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Connection': 'keep-alive', 'Accept-Language': 'en-US,en;q=0.5', 'Upgrade-Insecure-Requests': '1', 'Cache-Control': 'max-age=0'}
Enter fullscreen mode Exit fullscreen mode

Let's send another request to see if the headers have changed:

page.request.headers
>>>{'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"', 'sec-ch-ua-mobile': '?0', 'upgrade-insecure-requests': '1', 'sec-fetch-site': 'none', 'sec-fetch-mode': 'navigate', 'sec-fetch-user': '?1', 'sec-fetch-dest': 'document', 'accept-language': 'en-US,en;q=0.9'}
Enter fullscreen mode Exit fullscreen mode

Nice, we can see our request randomly sends different headers.

Adding sleep()

I haven’t used this tip, but it would be a good thing to implement.

Some web scrapers consist of sending a lot of requests in a short period.
Websites that are being scraped can detect that as suspicious activity.

In those situations, a smart decision would be to mimic a human being.

What do I mean by that?
No human being spends the same amount of time between two clicks while browsing the page.
We look around the page, checking where to click next, reading, etc.
So, you can use python’s time module (sleep()) to stop your scraper for a few seconds between requests.
Also, you can make your scraper ‘accidentally’ click some random link on a website.
Like ‘oops, didn’t want to go there….let’s go back’.
Humans make those mistakes, bots don’t...unless you make them look more like a human.

Finally...

As I said, technology moves fast, so we can expect websites will be able to detect bots even easier.
But some precaution measures should be set in place simply because we don’t want to cause problems for ourselves, for websites and other users.

If you have some opinions, tips, or experiences, let me know!
Cheers!

Discussion (2)

Collapse
hanpari profile image
Pavel Morava

The way you put it is similar to online advising how to annoy people.

You should read robot.txt and respect it, at least in theory.

I would recommend to read this or similar articles before conducting such actions.

tutorialspoint.com/python_web_scra...

Moreover, you may get your IP on some blacklist which might be troublesome.

Collapse
joks84 profile image
Jelena Author

Ah yes, robot.txt - didn't mention that in the post.
Before I started learning about scraping, I visited some website which blocked the entire range of IP address (including mine) because of badly-behaved bots.
Definitely, scraping can cause a lot of trouble for many people and companies.
Once again, thanks for the feedback :)

Forem Open with the Forem app