Metrow

Posted on Dec 16, 2022

How to Crawl a Website Without Getting Blocked: 17 Tips

#crawling #proxies #tips

Collecting publicly available data is hardly possible without web crawling and web scraping. The first one indexes web pages and discovers target URLs, while the second can extract the data for your needs. Unfortunately, these processes can unintentionally harm websites, so they set up defenses against them.

You must be prepared to tackle anti-scraping measures and know how to crawl a website without getting blocked. Otherwise, your data collection process can become a nightmare and require more resources than necessary. The list below covers the most common tips and a few extra measures on how to avoid bans while web scraping.

Follow the rules in Robots.txt

Robots.txt, or robots exclusion protocol, is a standard way for websites to determine the rules of how web crawlers and other bots should behave while visiting. Most importantly, it specifies how frequently bots can send requests to the website and which pages should not be accessed.

Checking the Robots.txt protocol is one of the first things to do when web scraping. You can find it by adding “/Robots.txt” to the web address. It will show you whether you can crawl the website. Some exclusion protocols restrict all bots from entering. Others allow only major search engines, such as Google.

Even if the website allows scraping, there will be rules outlined in Robots.txt. These rules are in place to keep the website performing well for regular visitors and protect personal or copyrighted data from being collected.

Most website defenses are made specifically to prevent bots from violating exclusion rules. If you don’t follow them, anti-scraping defenses will see your bot as a threat that can slow down or even crash the website, which can lead to bans.

Use proxies

Tracking IP addresses is one of the most common anti-scraping measures. It allows websites to assign different metrics to individual visitors and discover bots. Many other anti-scraping implementations are possible only if the website can track IP addresses. Therefore, web scraping is almost impossible without switching between IP addresses to appear as multiple different visitors.

Additionally, using a proxy server enables you to bypass geo-restrictions and access content that would otherwise be unavailable in your country. Some websites change their content to visitors according to their IP location, but a proxy will allow you to access data from other regions as well.

Other means of changing your IP address, such as VPN or Tor browser, are inferior to proxies when it comes to data collection. They are hard to integrate with most tools, have poor IP selection, and the practice might even be discouraged by providers.

Rotate residential IP addresses

IP rotation is when a proxy server automatically assigns different IP addresses after a set interval or for every new connection. Switching between proxies manually takes time and might put you at higher risk as sending too many requests without rotating will get your IP banned.

Rotating residential proxies is the safest choice for avoiding bans while web scraping. Such IP addresses originate from physical devices with connections verified by internet service providers. It will allow your crawlers and scrapers to blend in better with regular visitors lowering the chance of bans significantly.

Limit the traffic you create

Crawlers and scrapers can build enormous pressure on the website's server, sending requests and creating traffic at the pace of thousands of humans. So much traffic from a single visitor is a clear sign it's a bot. Sometimes websites might allow it, but you should still mind the traffic you create.

If you don't respect the website's limitations, your actions can appear as an attack, trying to take down the website instead of peacefully collecting data. Needless to say, you will lose your progress and will get the IP address banned.

Different approaches to limiting the traffic created by a scraper bot exist and might vary according to the scraping software. However, delaying requests and crawling during off-peak hours are the most common.

Delay requests

Web scrapers can send requests constantly without any breaks or time for thinking. While humans, even the fastest ones, will have to pause for a few seconds between requests. No delay between requests is an obvious sign of a bot for any website monitoring visitors' behavior.

However, a delay between requests is not enough as they also must be randomized. For example, sending requests exactly every 5 seconds or minutes will seem suspicious. No ordinary user would browse a website like that. Spread the requests at random intervals as it is the best way to imitate human activity.

If you see that your crawler is slowing down, it might mean that your requests are too often, and you should put more delay. Usually, robots.txt specify the minimum delay, but don't use exactly the same delay if you want to blend in better.

Choose the right time for scraping

It might seem that the best time for collecting is when it's convenient for you. Unfortunately, it often overlaps with peak hours of websites as most visitors tend to flock at around the same time. It is best to avoid scraping during such hours, and not only because it can slow down your progress.

Putting pressure on the server when ordinary users are also visiting in great numbers can lead to lowered performance or even a website crash. Such performance hits might negatively affect the regular user’s experience.

The best time to scrape might not be the most convenient for you, but it can reduce the number of IP blocks significantly. A great way to solve this problem is to test the site at different hours and schedule your web scraper only when the server performs at its best.

Avoid predictable crawling patterns

Ordinary visitors browse websites somewhat randomly as they look for information through a trial and error fashion. Bots already know the website's structure and can go straight to the needed data. Such crawling patterns are recognizable, and some websites even employ advanced AI technologies to catch them more efficiently.

Changing how your crawler bot travels around the website will help you appear more as a regular user and reduce the probability of bans. Include random actions, such as scrolling and visiting unrelated links or pages that aren't your target. Mix these actions on every visit to the website like an ordinary user would.

This tip goes together with randomizing the delays between your requests, as even the most natural browsing pattern will seem suspicious if done too quickly. But while delays can be random, changing the crawling pattern will require you to study the website's layout and find natural ways of browsing.

Avoid scraping behind logins

Collecting data protected by a log-in creates a few issues for crawlers. On most websites, logging in requires registering an account and agreeing to its Terms and Conditions. Most of the time, they explicitly state that the use of bots isn't allowed or restricted to some degree. Similarly, they will forbid data scraping, which would cause you to break an agreement and, in turn, the law.

A lot of information behind a log-in isn't public either. Scraping copyrighted or personal information could even be against the law. It’s best to avoid scraping data behind logins entirely. An exception could be if the data is available without logging in with something additional being displayed if you do so. Then, it is likely that the publicly visible data would be fair game.

Deal with honeypots

Honeypot traps are security mechanisms that protect websites from online threats while identifying bots, including crawlers and scrapers. Spider honeypots are specifically designed to catch web crawlers and work by placing links only visible to them, usually hidden within the code.

If your crawler visits such a link, the website will know it is a bot, and an IP ban will follow. However, implementing efficient honeypot traps requires a lot of knowledge and resources. Most websites can't afford sophisticated decoy systems and use simple tricks that are easy to notice.

Before following any link with your web crawler, check that it has proper visibility for regular users. Usually, honeypots have hidden visibility in CSS, so it is better to set the scraper to avoid links with background color, “visibility: hidden”, “display: none” or similar properties.

Optimize HTTP headers

HTTP request headers provide crucial context for servers when transferring data online. They determine what data has to be sent over and how it must be organized. Most web scraping tools do not have any headers by default, while browsers send multiple ones to tailor their requests.

Websites can use information in HTTP headers to tell whether the request is coming from a regular visitor or a bot. You should optimize your headers to make your web crawler blend in with other users and avoid bans. Here are some of the most common HTTP request headers to know:

The referrer header tells the website about the last page you visited
Accept header instructs the server about the type of media the client accepts.
Accept-Language header informs about the client's preferred language
Accept-Encoding header communicates the acceptable data compression format.
The host header contains the domain name and port number for the target server.

You should set these headers to look like ones from regular internet users. While optimizing these and other HTTP headers will reduce the risk of bans, one header - the User-Agent - is essential and deserves additional attention.

Change User-Agents

The User-Agent header provides the server with information about the operating system and browser used for sending the request. Every regular internet user has a User-Agent set so that websites would load properly. Here's what a User-Agent header could look like:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0

If there is no user agent set, the website might not send you the data. The User-Agent is the first header checked when detecting bots. It is not enough to set one user agent for your web scraper and forget about it. You should use only the most popular user agents and change them frequently.

Using widespread user agents will allow you to blend in better, and frequently changing them will create an impression of multiple different users. If you look for more tips about optimizing user-agent and other HTTP headers, check our article on the topic here.

Don’t scrape images

Images in websites make them appear pleasant to the eye and easier to navigate, but this only applies to humans and not bots. Images should be avoided while web scraping as it is a burden for your bot and can cause unnecessary risks.

The graphical content of a website can take a lot of space, meaning you will need to store it somewhere if you download it. If they aren't your target, the web scraping process will be much faster without them. Additionally, you might even save some money in terms of bandwidth.

Images can also be copyrighted, and collecting them from a website might be a serious infringement. In case you need any images, make sure to double-check whether you can extract them. But if they are not necessary, don't scrape them.

Use a unique browser fingerprint

Browser fingerprinting identifies the visitor's device by creating a digital signature, also known as a fingerprint. Using Transmission Control Protocol (TCP) and other means of communication with a browser, websites can identify the browser type, operating system, screen resolution and even hardware.

When you visit a website, your browser reveals these parameters, and they help track you even if cookies are disabled. Bots leave fingerprints too, so it is an increasingly popular method of restricting web crawling and scraping. The number of bans can grow significantly without good anti-fingerprinting measures.

Besides using proxies and common user agents, there are two main strategies to fight browser fingerprinting. One is to decrease the uniqueness of your browser. Uninstall any add-ons and use only up-to-date versions of the most popular browsers, such as Google Chrome or Mozilla Firefox. If it looks like millions of other browsers, used by many other people, it won’t be possible to track you through fingerprinting.

A nuclear option is to disable JavaScript with an extension such as Ghostery or NoScript. It will make browser fingerprinting almost impossible, but more and more websites these days are not loading properly without Javascript enabled, so you’d likely lose out on some data.

Don’t scrape unnecessary JavaScript elements

Checking whether the visitor's browser uses JavaScript is one of the easiest measures against bots websites implement. They just send some JavaScript elements and see if the visitor can load them. If they can't, it is most likely a bot as almost all regular browsers have JavaScript enabled.

It may seem then that you should always render all JavaScript elements. However, dynamic website elements can be a tough nut to crack for a web scraper. They are hard to load, depend upon specific user interactions, and might add a lot of unnecessary traffic.

Therefore, it is best to avoid JavaScript elements when you can. But If the checks are fierce, or if you want to scrape some dynamic elements, you will have to render at least the necessary ones.

Use a headless browser

A headless browser is one without any graphical user interface (GUI). Instead of navigating through websites with buttons and search bars, you can use commands to access the content. Because of their speed, headless browsers are used for testing websites and automating tasks, but they are also beneficial for web scraping.

They allow easier access to dynamic elements and can pass JavaScript checks. Some of the most popular web browsers, such as Google Chrome or Mozilla Firefox, have headless modes. A more advanced headless browsing experience can be achieved with tools like Selenium, HtmlUnit, Puppeteer or Playwright.

Setting up a headless browser isn't for everyone. You need to be familiar with at least basic commands and have some development knowledge to get them running properly.

Solve CAPTCHAs

Sometimes you do everything correctly, and the website still wants to block you. CAPTCHA, or Completely Automated Public Turing Test to Tell Computers and Humans Apart, is one of the signs that the website might be looking for an excuse to ban your IP address.

CAPTCHAs work by providing a test that only a human can solve. Most frequently, it's an image-based task asking visitors to select pictures according to a provided theme. If the test is unsolved in a given timeframe, the visitor is denied access and considered a bot. An IP ban might follow after several unsolved CAPTCHAs.

It is always better to avoid CAPTCHAs, but it is not the end of the world if you get some. Most web scrapers either solve some CAPTCHAs themselves or can work together with CAPTCHA solving services. With good API integration, such services can solve these tests rather quickly.

However, success isn't always guaranteed, and the price of CAPTCHA solving services may rise exponentially if you receive too many tests. It is more efficient to use every other tool first and keep solving CAPTCHAs as a backup plan.

Scrape Google cache

If nothing else seems to work and you are getting banned, there is one last trick to try. Instead of crawling and scraping the website itself, you can scrape the data on Google cache. Simply paste "http://webcache.googleusercontent.com/search?q=cache:" and write the original URL you intended to visit after the colon.

Even if the website has strong defenses and blocks most scrapers, it wants to be visible on Google. It is the source of most internet traffic, after all. Therefore, it will allow Googlebot, and there will be a copy of the website in their archive.

Unfortunately, it isn’t a perfect solution. Less popular websites are rarely updated, so some frequently changing content can be missing. Other websites even block the Googlebot from crawling, so you won't be able to find them. But if you can locate the needed website, scraping Google cache will make things easier.

Frequently Asked Questions

How do I stop being blocked from web scraping?

Web scraping tools, just like crawlers, employ bots to visit websites. Both processes trigger the same defensive measures. In short, the most effective methods are following the rules outlined in Robots.txt, using proxies and optimizing the user agent along with other HTTP headers.

Is it legal to crawl a website?

Crawling and scraping websites is legal, but there are some exceptions you should know. The first one is copyrighted content. Such data is often marked in some way, for example, with the copyright symbol ©. Keep an eye on these signs to avoid getting in trouble.

Personal information is also off limits unless the owner explicitly states their consent. On social media sites, for instance, personal data is saved behind log-ins, so it is better to avoid crawling and scraping there altogether.

Accessing publically available, copyright free and non-personal data is the safest option. Always consider consulting with a legal professional. Another great idea is to check out our in-depth blog post about this topic.

Can you be banned from scraping?

You can get your IP address banned while scraping large amounts of data without any precautions. In some cases, websites can block registered accounts and browser fingerprints too. Luckily, no ban is permanent as you can change the IP addresses with a proxy and start all over again.

So neither your web scraper nor the devices you use can be completely shut down from web scraping.

Can websites detect scraping?

Many websites can detect web scrapers and crawlers if you do not take any precautions. For example, crawling without a proxy and ignoring robots.txt will most likely arouse suspicion. There are some indications signaling that your actions have been detected.

Increased number of CAPTCHAs
Slow website loading speed
Frequent HTTP errors - 404 Not Found, 403 Forbidden and 503 Service Unavailable and others
IP address block

Conclusion

The tips covered here will give you a head start against most websites, but anti-scraping tools will continue to evolve. Be sure to update your knowledge about what web scraping is and the industry best practices. But no matter how much data collection changes, following Robots.txt, switching user agents, and using proxies will remain the most effective measures.

Top comments (1)

Crawlbase • Apr 3 '24 • Edited

Thanks alot! This guide offers crucial tips for web scraping success, emphasizing adherence to Robots.txt, proxy use, and rotating IP addresses. Crawlbase seems like a valuable tool for efficient data collection. Great resource for navigating scraping challenges!