What are the methods used against web scraping?

Perrine — Wed, 04 Mar 2020 14:22:57 +0000

Even though web scraping is commonly used across most industries, most websites do not appreciate it and new anti-scraping methods are being developed regularly. The main reason is that aggressive web scraping can slow down the website for regular users, and in the worst-case result in a denial of service. To prevent you from scraping their websites, companies are using various strategies.

Limiting the scraping

IP rate limiting, also called requests throttling, is a commonly used anti-scraping method. A good practice of web scraping is to respect the website and scrape it slowly. This way, you will avoid monopolizing the bandwidth of the website. The goal is for the regular users to still have a smooth experience of the website in parallel of your scraping. IP rate limitation means that there is a maximum number of actions doable in a certain time on the website. Any request over this limit will simply not receive an answer.

Blocking the web scraping

While some websites are okay with a simple regulation of the web scraping, others are trying to prevent it all together. They are using many technics to detect and block scrapers: user agent, CAPTCHAs, behavioral analysis technology, blocking individual or entire IP ranges, AWS shield, … You can read more about how to scrape a website without being blocked in this article.

Making the data collection harder

Some websites modify their HTML markups every month to protect their data. A scraping bot will look for an information in the places it found it last time. By changing the pattern of their HTML, the websites are trying to confuse the scraping tool, and making it harder to find the desired data.

In addition, the programmers can obfuscate the code. HTML obfuscation consist in making the code much harder to read, while keeping it perfectly functional. The information is still there but written in an extremely complex way.

Another technique is to make a dynamic user interface with Javascript or AJAX. The page only loads some portions of the contents. The information to collect can be found behind some buttons, not requiring reloading the page. This will result in a time out when scraped.

Providing fake information

In our article about scraping without getting blocked, we’ve talked about Honeypots, those links that only bots will find and visit. Some other techniques are also intended to be seen only by bots and not by regular users. This is the case of cloaking. This is a hiding technique that returns an altered page when visited by a bot. Normal user only will be able to see the real pages. The bot will still collect information, without knowing that it is fake or incorrect. This method is really frowned upon by Google and other search engines. Websites using it are taking the risk to be removed from their index.

ScrapingBot takes care of all those struggles for you, so don't hesitate to give it a go if you have a scraping project. There will be an API for your needs ;)

How to scrape a website without getting blocked?

Perrine — Mon, 06 Jan 2020 14:12:46 +0000

When you want to collect and analyze data, let it be for price comparison, statistics or to see a general evolution, scraping is a great and essential time saver. However, many websites do not appreciate to be heavily scraped, some of them don’t allow it at all, especially in the retail industry sector. There are some generic rules and tricks to respect/follow if you do not want to be blocked from scraping a website, temporarily or permanently.

IP rotation

Rotating IP is key when scraping websites. Most e-commerce and retail websites do not appreciate being scraped.

When you’re scraping a website, you want the data to be collected fast. However, when websites receive multiple requests simultaneously from a single IP address, they detect that it is a scraper and block it. To avoid being blacklisted, the best way is to use proxies. They will use a pool of different IP addresses to route your requests.

Scrape Slowly

The whole point of scraping is to collect data quicker than if it was done manually. As a result, scrapers are browsing websites very fast. The websites can see how long you spend on each page, and if it is not human-like, they will block you. That’s why even if it means being less effective, it is worth limiting the speed. Find the optimal speed, and add some delays between the pages and requests. On a retail website, this is key to scraping data.

Scraping Patterns

If not specified otherwise, the crawler will always use the most effective route. This seems great, except that it shows a huge difference with human users navigating much slower. As a result, going fast makes the scraper very easy to spot and block. To avoid being blacklisted, you must mimic a standard user: set some delays between clicks, avoid repetitive browsing behavior, add some mouse movements and random clicks. Basically, you need to program your robot to look less like a robot and more like a person.

Honeypot traps

Honeypot Traps are links that are hidden in the HTML code. They are not visible by regular users visiting the website. That’s why when those links are visited, the website knows that there is a scraper on the page and they will block the IP address. The scraper needs to be able to detect if a link is made to be invisible. For example, a link can be set in the same colour as the background, so it is not visible to human users.

Switch User Agents

The user agent is a chain of characters informing the website on how you are visiting it: what browser, version and operating system you are using. As for the IP address, a single user agent, when used by a human user, will not send as many requests per minute compared to a crawler. Therefore, it is important to create a list of different user agents and regularly switch between them, to avoid getting detected and blocked.

Respect Robots.txt and the website in general

The robots.txt file is based at the root of the website. It set the rules of crawling: which parts of the website should not be scraped, how frequently it can be scraped. Some websites are not allowing anyone to scrape them.

If you scrape a website too frequently and send too many requests at a time, you might overload the website servers and impact badly its performance. The owners want their site to run smoothly for everyone, so they might block you to rebalance the performance.

Even better, you can use ScrapingBot and all this will be dealt for you ;)

DEV Community: Perrine