Web scraping and anti-bot systems.

#security #webscraping #bots

Nowadays it is widely acknowledged by web service owners that data scraping is an inevitable challenge they will face. As certain scrapers can adversely impact website servers by causing slowdowns, website owners employ all available means to safeguard their websites.

The main problem

One of the biggest challenges in web scraping is blocking which can be caused by different reasons. However, all these reasons can be consolidated into a fundamental fact:

web scrapers appears differently to websites and servers compared to a web browser used by a normal internet user.

The following paragraphs will present some of the factors that can be used by security systems to differentiate between normal users and web scrapers.

Blocking Factors

Blocking factors can be split into three groups, Network factors, Client / Browser factors and behavior factors.

Network factors

All communication between the clients and the servers go through multiple communication protocols that are represented as a multiple layers system. Each layer have its own way of formatting and organizing data before sending it, and this is what is called encapsulation. The data added in each layer is used to identify both the sender and the receiver, there for an automated bot can leave multiple clues and fingerprints that can lead to its identification as a malicious connection, and therefor being blocked.

Client factors

Browsers are complex software and often times we don’t think about how they work, we're just using them. Automated browser are quite popular for being used to do web scraping when dealing with JavaScript heavy web pages. Using JavaScript, the developer behind the website is allowed to execute arbitrary code on client machine to access specific information that can be used to identify whoever is behind the connection. This way automated browsers can be easily identified, especially when they are not correctly patched to not leak information about themselves.

Behavior factors

When automating a browser the interactions with the web page will have determined patterns and will take split seconds, in contrary human interaction with a web page is pretty slow and follows random patterns. Since the the two are too distinct some security systems can easily spot and block all users that are presumed harmful or suspicious.

Conclusion

When creating large scale web scraping software, one should choose the right tools and make the right design decisions to avoid being blocked by websites in order to build robust and reliable web data collection solutions.

DEV Community

Web scraping and anti-bot systems.

The main problem

Blocking Factors

Network factors

Client factors

Behavior factors

Conclusion

Top comments (0)

Read next

Using Verified Permissions with Cognito to control access to API endpoints

AWS Credentials for Serverless

Recent Rust Security Advisory: CVE-2024-24576

HackTheBox - Writeup Keeper [Retired]