Even though web scraping is commonly used across most industries, most websites do not appreciate it and new anti-scraping methods are being developed regularly. The main reason is that aggressive web scraping can slow down the website for regular users, and in the worst-case result in a denial of service. To prevent you from scraping their websites, companies are using various strategies.
Limiting the scraping
IP rate limiting, also called requests throttling, is a commonly used anti-scraping method. A good practice of web scraping is to respect the website and scrape it slowly. This way, you will avoid monopolizing the bandwidth of the website. The goal is for the regular users to still have a smooth experience of the website in parallel of your scraping. IP rate limitation means that there is a maximum number of actions doable in a certain time on the website. Any request over this limit will simply not receive an answer.
Blocking the web scraping
While some websites are okay with a simple regulation of the web scraping, others are trying to prevent it all together. They are using many technics to detect and block scrapers: user agent, CAPTCHAs, behavioral analysis technology, blocking individual or entire IP ranges, AWS shield, … You can read more about how to scrape a website without being blocked in this article.
Making the data collection harder
Some websites modify their HTML markups every month to protect their data. A scraping bot will look for an information in the places it found it last time. By changing the pattern of their HTML, the websites are trying to confuse the scraping tool, and making it harder to find the desired data.
In addition, the programmers can obfuscate the code. HTML obfuscation consist in making the code much harder to read, while keeping it perfectly functional. The information is still there but written in an extremely complex way.
Another technique is to make a dynamic user interface with Javascript or AJAX. The page only loads some portions of the contents. The information to collect can be found behind some buttons, not requiring reloading the page. This will result in a time out when scraped.
Providing fake information
In our article about scraping without getting blocked, we’ve talked about Honeypots, those links that only bots will find and visit. Some other techniques are also intended to be seen only by bots and not by regular users. This is the case of cloaking. This is a hiding technique that returns an altered page when visited by a bot. Normal user only will be able to see the real pages. The bot will still collect information, without knowing that it is fake or incorrect. This method is really frowned upon by Google and other search engines. Websites using it are taking the risk to be removed from their index.
ScrapingBot takes care of all those struggles for you, so don't hesitate to give it a go if you have a scraping project. There will be an API for your needs ;)
Top comments (1)
Thankyou so much! This blog explains how websites defend against web scraping. It's important to know these methods to scrape effectively. Crawlbase offers solutions to overcome these challenges, providing powerful scraping tools and APIs. If you need help with scraping projects, Crawlbase is the way to go!