DEV Community

Cover image for Is Web Scraping Legal Or Not?
Rajat Thakur
Rajat Thakur

Posted on

Is Web Scraping Legal Or Not?

Whether it’s unethical hacking, identity theft, internet scams, social engineering, and many more, we hear and see regulations that openly seek to suppress all forms of crime and fraud on the net. But the position of Internet law on the legality of web scraping still remains controversial.

Since you may also find yourself collecting data from the web as I collect news data from the web with the help of news API, now or in the future, for commercial or personal purposes, the question that comes to our mind is, is web scraping legal? You will soon know.

Notable Historical Legal Issues of Web Scraping

Most of the previous legal battles between companies over web scraping ended up leaving traces of mental puzzles. With the twists and turns involved, if not fully discussed, a plaintiff could even find themselves at fault despite taking legal action against others for scraping their website.

There have been cases where we can shed some light on the legality of web scraping. So, a logical analysis of this will help you understand the legal position of the argument. Before we go any further, let’s look at a few of these cases.

Facebook’s Web Scraper Clampdown Quest

Along with a few data breach stories, Facebook has faced several backlashes for being careless with user data. And when it came to scraping the web on these social networks, Cambridge Analytica didn’t stop at low numbers when it massively swept Facebook in 2016 to try to identify undecided voters.

Although the scraping does not technically affect the proper functioning of Facebook or any of its services, Congress found that Cambridge Analytica misused the collected data. And Facebook would later be fined $5 billion in 2019 by the Federal Trade Commission for its alleged role in violating the privacy of its users.

We are thus witnessing a lesser penalty for the abuse of available private data rather than the act itself.

Cambridge Analytica also had its share in the deal. And it was perceived in a certain shady way. The company then filed for Chapter 7 bankruptcy in 2018 after claiming to have lost many of its political clients.

From the hard lesson learned, Facebook would then go to great lengths and take legal action against some web scrapers.

This may have highlighted the case of Facebook in 2020, against two Ukrainians who deceptively scraped its users’ data using browser extensions and quiz apps. You would have thought that this was another example that you may have been used to collecting data from the wrong place using the wrong method.

Although the court ruled in favor of Facebook in both cases, it did not punish the offenders beyond bearable. The court, however, found the activities of these extensions to be harmful and recommended a permanent injunction against the defendants.

“Malicious” was an apt description of the activity of these scrapers, as they collected personal data from Facebook users without their discretion.

When Is Web Scraping Illegal?

As mentioned above, the legality of web scraping seems to be a dead-end as there are no regulations binding it. So it looks like you can scrape the web all you want after all. And looking logically at past salient cases of data scraping, it is clear that web scraping is not illegal.

But your technical approach and the way you use the collected data speak volumes. However, adequately describing and deciphering the conditions surrounding each scraping activity says more about its legality. For example, as with any policy violation, the law had in the past met screen scraping with penalties for breaching the terms.

Basically, although we said screen scraping is not illegal, you can make it illegal when you do it incorrectly or maliciously. While you mean no harm, some tech companies frown on web scraping. And while they let you scrape it, some tell you what and what you shouldn’t do with the data they scrape. Violation of these terms could result in a legal injunction. Watch out for red flags. So read the data privacy terms before taking any data from any website.

Data Theft VS Data Scraping: What’s the Difference?

Data theft is often the consequence of many breaches occurring on the Internet. When this happens, the credibility of the affected website is reduced. Worse still, there have also been instances where stolen data has surfaced on the Dark Web. Web scraping in the true sense of the word is broad.

But fundamentally, it often involves screen scraping, which is the gathering of pre-rendered information from the front-end. Such activity is unlikely to affect the technical corner of a website. Also, data retrieved this way is often not secure and anyone can collect it.

But in some cases, a data scraper can also scrape a database directly by monitoring data streams. Such an approach to data collection, if formal, is often backed by an agreement between scraper and source. And in cases where there is no agreement between the parties, this data must have been made available to the public.

Otherwise, if you are not authorized to connect to a database, it can become dodgy and hacked when you try to retrieve data from it in real-time. You can define this data theft as unethical information harvesting.

Data theft, on the other hand, aims to recover confidential information without authorization. This can therefore compromise the integrity of a website, as it sometimes involves hacking into a database. However, it is still partially correct to say that data theft is a misuse of web scraping.

In addition, there are binding laws and regulations regarding data theft. So even if you claim to recover data, it is theft when you forcibly collect confidential data.

Sometimes data thieves or hackers exploit a vulnerability in a website to perpetuate data theft. And many of these cases have gone unpunished. However, you should be careful and ensure that you do not delete data from where you are openly unauthorized.

Is Web Scraping a Result of a Website’s Vulnerabilities?

Security vulnerabilities can undoubtedly lead to a data breach. People can use web scraping illegally when they misuse scraped data or use unethical technical processes to retrieve information. But of course, there is no need to exploit vulnerabilities. So a website, no matter how secure, seems to have little control over what people can and cannot scrape.

Can You Get Blocked From Scraping a Website?

A robot.txt file is a popular tool used by businesses to prevent bots from accessing specific directories on their website. Before scraping, you can check if a website allows a particular page to be crawled by typing websiteurl/robots.txt in the console browser search.

And when such a file does not serve its purpose, some websites write additional security scripts that block malicious IP addresses to prevent unauthorized access to their content. Despite these efforts, people still manage to get what they want. DOM analysis, along with machine learning techniques such as natural language processing and computer vision, are technologies powering some data scrapers today. Some of these techniques are clever and trick a website’s security wall by adapting human browsing behavior.

What Types of Websites Can You Scrape?

You probably know by now that web scraping is only legal when you use it for a good course. And there are many business ideas for web scraping. But as stated earlier, some websites don’t like to be rambling. So what categories of websites are there on the internet where you can collect data?

1. Social Media

Social media websites are some of the most trusted sources when it comes to removing natural language and sentiment. Social media giants like Facebook and Twitter even offer APIs that allow developers to connect to them and use their data. This data is often programmable and can only be integrated into applications for certain solutions. Therefore, they may not be explicitly downloadable in CSV or Excel files, as you might when extracting a large volume of data from open source websites.

That said, some of them even allow you to grab and download user comments without revealing who posted them. Twitter, for example, offers a dedicated API called Tweepy that you can use to semantically capture user tweets. For example, using Tweepy, you can collect all tweets that have a certain keyword.

2. E-Commerce and Directory Websites

E-commerce stores and directory websites are arguably the most reliable sources for gathering market and product data. Walmart, Amazon, and eBay are some of the top e-commerce sites where people search for product information. Although some of these websites do not indicate whether or not they allow scraping, some do. So you might want to be careful with this to avoid legal consequences. But since these products are available on the client-side, you should scratch well.

3. News and Media Websites

Websites for news and media are excellent sources of information. In order to obtain SEO insights, people will sometimes scrape them. You can scrape news sites and blogs as long as you don’t reproduce or plagiarise their content. Newsdata.io is a great news API to scrape news data from thousands of reliable news websites from around the world in 10+ languages.

4. Job Boards

Many companies turn to popular job boards to recommend the most in-demand skills to their clients. Also, since many of these websites contain resume examples, they are good sources of resume templates for various types of jobs. LinkedIn, Indeed, and Glassdoor are examples of job sites that companies that recommend jobs collect. If you don’t cross the line, you should have no problem collecting data from these websites as well.

5. Search Engines

Although it may seem overwhelming and laborious, search engines are the best places to look for publicly available data. Content management companies sometimes pull query results from search engines like Google and Bing for keyword and SEO information. In terms of legality, search engines are the safest to scan because they offer easily indexed information.

Conclusion

Web scraping is one of the most complex enemies to fight on the Internet today. Everyone, including regulators and even those who disapprove of it, scrapes the web in one way or another. This tool is invaluable in many areas including but not limited to market research, artificial intelligence, SEO, etc.

Although its legality depends on a few key factors, it doesn’t look like there will ultimately be a strict sanction against use. That said, although it does not violate any legal clause, it is a free world on the net. So feel free to scrape the web as you wish.

Top comments (0)