DEV Community

Cover image for 7 ways to avoid getting blocked or blacklisted when Web scraping
Peter Hansen
Peter Hansen

Posted on

7 ways to avoid getting blocked or blacklisted when Web scraping

How to Avoid Getting Blocked or Blacklisted when web scraping

If you're doing a lot of web scraping, you might eventually get blocked. This is because some websites don't want to be scraped, and will take steps to prevent it. However there a list of technics you can use in order to avoid getting blocked or blacklisted by the website you're scraping. There are a few things you can do to avoid this:

  1. Use tools/proxy servers for rotating IP. There are many web scraping tools available, both free and paid.

  2. Don't scrape too aggressively. If you make too many requests to a website too quickly, you're likely to get blocked. Space out your requests so that they don't look like they're coming from a bot, and make sure to obey any rate limits that the website has in place.

  3. Scrape responsibly. 

By following these tips, you can avoid getting blocked or blacklisted when web scraping. Let's have a look at these tips with more details.

1. IP Rotation

If you're serious about web scraping, then you need to be using IP rotation. This is because most websites will block IP addresses that make too many requests in a short period of time. By using IP rotation, you can keep your scraping activities under the radar and avoid getting blocked or blacklisted.

There are a few different ways to rotate IP addresses. One way is to use a proxy server. A proxy server is basically a middleman that routes your requests through a different IP address. This means that the website you're scraping will only see the proxy server's IP address, not your IP address.

See a video explaining how proxy server works:

Using proxy servers has several benefits. First, it makes it much harder for websites to track and block your activity. Second, it allows you to make more requests in a shorter period of time, since each proxy can have its own IP address. And third, it allows you to rotate your IP address quickly and easily, which is important for avoiding detection and getting blocked.

Another way to rotate IP addresses is to use a VPN. A VPN encrypts all of your traffic and routes it through a different IP address. This is a bit more secure than using a proxy server, but it can be slower since your traffic has to be encrypted and decrypted. While this is a good and reliable solution, there are not many vendors on the market today offering easy-to-use and affordable solutions.

Finally, you can also use a service that provides rotating IP addresses. These services usually have a pool of IP addresses that they rotate between users. This is the easiest way to use IP rotation, you simply need to include proxy service provider into your request URL, something like this.

Your target website is: https://www.amazon.com

Instead of sending your request directly to the target you will send it through proxy service for example:

request({
    method: 'GET',
    url: 'https://proxybot.io?url=https://www.amazon.com'
}, (err, res, body) => {
Enter fullscreen mode Exit fullscreen mode

In this case every request will go though a random server. The target will have no idea that all the requests are coming from you because there is no connection between them.

List of popular proxy providers can be found here.

2. Set a User-Agent header

A User-Agent is a piece of information a special type of HTTP header that tells a website what kind of browser you are using. By setting a real user agent, you will be less likely to get blocked or blacklisted because the website will think you are a regular person browsing the internet with a normal browser. 

Some websites might block requests from User Agents that don’t belong to a major browser. Setting User Agents for web crawlers is important because most websites want to be on Google and allow Googlebot through. Setting the User-Agent header will definitely lead to more success when web scraping.

The user agent should be up to date with every update because it changes per browser update, especially Google Chrome. List of popular User agents can be found here.

3. Set Other HTTP Request Headers

In order to make web scraping less conspicuous, you can set other request headers. The idea is to mimic the web browser's headers used by real users this will make your scraper look like a regular website visitor. The most important headers are Accept, Accept-Encoding, and Upgrade-Insecure-Requests, which will make your requests look like they are coming from a genuine browsing device rather than a robot.

Read full guide about how to use Headers for web scraping.

Some websites may also allow you to set a Referrer header so that it appears as though you found their site through another website.

For example, the headers from the latest Google Chrome is:

“Accept”: “text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,

image/apng,*/*;q=0.8,application/signed-exchange;v=b3″,

“Accept-Encoding”: “gzip”,

“Accept-Language”: “en-US,en;q=0.9,es;q=0.8”,

“Upgrade-Insecure-Requests”: “1”,

“User-Agent”: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36”
Enter fullscreen mode Exit fullscreen mode

4. Randomize time In Between Your Requests

If you are web scraping, you might be making requests at a rapid pace which can get you blocked or blacklisted. To avoid this, you can set random intervals in between your requests. This will make it less likely for you to get blocked or blacklisted.

Pro tip: A website might have a robots.txt file which will allow you to know the exact delay that you should use in between your requests in order to avoid crashing their site with heavy server traffic.

5. Set a Referrer

When you are web scraping, it is important to set a referrer so that you do not get blocked or blacklisted. A referrer is the URL of the page that you are coming from. To set a referrer, you can use the following code:

var referrer = "http://example.com";

This will set the referrer to http://example.com. You can also use a wildcard to set the referrer to all pages on a domain:

var referer = "*://*.example.com";

This will set the referrer for all pages on example.com.

6. Use a Headless Browser

Some websites can be tricky to scrape because they will check for tiny details browser cookies, web fonts, extensions and javascript execution in order to determine whether or not the request is coming from a real user.

If you're planning to do any serious web scraping, you'll need to use a headless browser. A headless browser is a web browser without a graphical user interface (GUI). Headless browsers provide a way to programmatically interact with web pages, and are used in many applications, including web scraping.

There are many advantages to using a headless browser for web scraping. First, headless browsers are much less likely to be detected and blocked by websites. This is because they don't send the same kind of "user-agent" information that regular browsers do. User-agent information can be used to identify and block certain kinds of activity, so by hiding this information, headless browsers are much more stealthy.

Another advantage of using a headless browser is that they can render Javascript, which is important for many modern websites. Many website features are powered by Javascript, and if you want to scrape data from these kinds of sites, you'll need a browser that can execute Javascript code. Headless browsers can do this, whereas traditional web scraping tools often can't.

I have a guide explaining how to scrape data from a website built with a javascript framework.

If you're doing any serious web scraping, using a headless browser is essential. Headless browsers are more stealthy and can render Javascript, which traditional web scraping tools often can't.

7. Avoid hidden traps

You can avoid being blocked by webmasters by checking for invisible links. Some websites detect web crawlers by putting in invisible links that only a robot would follow. If you scrape a website and find these, then avoid following them. Usually this type of link will contain “display: none” or “visibility: hidden” attributes. You may want to also check for color-related invisibility including the color set on the link, as this is another way of hiding a link. For example color of a link can be set to the same color as background, for example color: #fff;

7 ways to avoid getting blocked or blacklisted when Web scraping

Top comments (0)