DEV Community: Sam Selman

What is Web scraping?

Sam Selman — Thu, 07 Mar 2024 13:22:37 +0000

The amount of data Google handles is extraordinary; it processes 200 petabytes daily.This points to the sheer volume of often invaluable data on websites, including business contacts, stock prices, product descriptions, sports team stats, and a lot more. Web Scraping allows you to tap into that.

Web Data Scraping – An overview

Web scraping, also known as data scraping involves collecting various types of data from the internet, be it content, numbers, images, etc.

Web scraping replaces the tedious and error prone process of manual copying and pasting, saving you time and money. Scraped data is usually fed into programs, spreadsheets or databases to be subsequently visualized, processed or used as machine learning training data.

According to Imperva, scraping tools, a.k.a. bots, accounted for about 37% of all internet traffic in 2019. The good bots, i.e. the ones performing applications such as described below comprised 13% of them, while the bad bots - used for spamming, stealing data and other malicious activities - constituted 24%.

Web Scraping Applications

Marketers and researchers use web scraping for lead generation, customer behavior analysis, price intelligence, competitor analysis, monitoring, and more. Following are the common usages of web scraping tools:

Lead Generation
Lead generation is essential for businesses since their operations depend on a steady supply of prospective customers. They use web scraping tools to get rich business leads without complicated or expensive inbound campaigns.

Price Monitoring
People nowadays look for the lowest prices with the best quality. Let's say you are an online seller with high-quality products or services but do not know the optimum price to sell or with what promotion strategy. Price scraping allows you to extract price and other data from your competitor's website in a structured manner. The data can help you to monitor your competitor's pricing and analyze their performance and marketing strategies.

News Monitoring
Also known as news scraping, this strategy involves data from online media and social websites. The specific data includes news articles, the latest information, market trends, and any news or information that can affect your business goals and strategies.

As a business owner, you must keep an eye on changing market trends and the latest news. This news often contains crucial public data and information that can benefit your interest; moreover, you can find data from any industry.

Market Intelligence
Market intelligence is the best way to gain an edge on your competition. Web Scraping allows companies to automate the collection of market intelligence through gathering data from sources all over the web and turning it into actionable insights. You can track prices, monitor trends, and collect customer feedback at scale.

Machine Learning
Machine learning engineers need data to train their models on. What better place than the Web for such data. From content classification to natural language processing, a plethora of applications resort to scraping data that the Web offers for free and in large amounts.

Web Scraping Challenges

At first, web scraping may look straightforward, but the fact is not everyone is receptive to strangers trying to access their data; large scale scraping involves data extraction from hundreds/thousands of pages at a rapid rate which has the potential to bring servers down.

When faced with a project involving scraping massive amounts of data, developers need to be cognizant of the following roadblocks:

IP blocking
IP blocking is one of the basic techniques employed by site owners for dealing with scrapers. When the server detects a significant number of requests from the same IP address or when a search robot makes many concurrent queries, blocking is triggered. There is also geolocation-based IP filtering. This occurs when the site is secured against data collection efforts from specified geographical areas. The website will either fully prohibit the IP or restrict its access.

The solution to this is using a proxy network to hide your original IP address. This allows data scraping without getting blocked in most cases. However, some proxy servers, especially those hosted in data centers can be detected with relative ease from their IPs. Residential proxies and Mobile proxies on the other hand, while expensive, are undetectable by IP alone since their IP addresses are those of regular users connecting via their ISP.

It’s also worth noting that when rapidly scraping several pages from the same website, hiding your original IP is only half the work. One also needs to resort to IP rotation to simulate requests coming from different users.

Honey traps
Website owners can use honey traps to catch scrapers using a non-visible link which generally won’t be followed by a real user to get the scraper's IP then block it. Scrapers need to be written with this in mind.

Slow website loading time
Some websites have slow loading times or throttle traffic coming from certain geographies, IPs or when repeated requests are detected. Scrapers must be able to deal with this through proper exception handling, time-spaced repeat attempts and proxy use.

Dynamic content
Most if not all websites nowadays rely heavily on JavaScript to implement all sorts of UI interactions or render data. With the wide adoption of client-side frameworks like React JS, Angular and Vue.js scrapers need to be able to execute JavaScript to get to the content they’re after. This means your scrapers should be able to render content in headless browsers via libraries like Puppeteer.

Cookies and Sessions
Some websites require the visitor to log in to access information; and even if login credentials are provided, these websites also require authentication cookies to be present on all requests and to originate from the same IP. Scrapers must therefore support cookies and sticky proxy sessions, i.e. be able to tunnel requests via the same IP when using a rotating proxy.

Captchas/Anti-bots
CAPTCHAs enable humans to be distinguished from robots. For verification, logical problems or character input are given, which people answer rapidly but machines cannot. Several CAPTCHA solvers are currently integrated into scrapers for continuous data collecting, although at the expense of a little slowdown.

Website Layout Changes
Website layout changes can disrupt the web scraping process and hinder the scrapers ability from accessing any of their information. It is therefore imperative to implement website change detection into your web crawlers to deal with sudden alterations in website layout.

Scraping Tools

General purpose Scraping APIs
These tools sit at the core of virtually every successful scraping project. They ensure that website content is retrieved in its entirety as if accessed by a human user and in the shortest time possible. From executing JavaScript to scrolling down automatically to render a page's full content, or using residential proxies to circumvent blocks, these APIs make it easier for developers to focus on the data extraction process rather than the html fetching part of scraper development. These include products such as ScraperAPI, ScrapingBee and Ujeebu Scrape.

Rule-based Scraping
Rule-based extraction consists in pulling data from a page while familiar with its html code. Most general purpose scraping APIs can be leveraged to do this since they come with a built-in rule engine which lets developers target specific bits of info inside a page. Tools that offer this include Apify, Browse.ai and Ujeebu.

Layout and Content Agnostic Scraping APIs
Layout and content agnostic scraping is the process of scraping content from websites without prior knowledge of their layout, html coding conventions or even content type. This relatively new breed of scrapers uses machine learning and sometimes computer vision techniques to detect and extract content without being provided any parsing rules. They don’t perform as well as rule-based scrapers but they provide very good results most of the time, and save considerable amounts of time especially when scraping hundreds and thousands of sites with different layouts and little or no use of semantic tags. Some of these tools include Zyte Automatic Extraction API, Diffbot Extract and our very own Ujeebu Extract.

Scraping Browser Extensions
Some web scraper tools are conveniently available as browser extensions to allow users to scrape the web with a simple login and a few clicks. Some of these include Web Scraper and Data Scraper. A quick search on the Google Chrome store for example will bring up a handful. Please note that some of these also have paying services if you would like to run your scrapers in the cloud as opposed to an open browser window.

Open Source Web Scraping Tools For Developers
When faced with a scraping project, developers can choose from a multitude of open source options. In what follows a list of hand picked tools:

How To Scrape Legally and Ethically?

While scraping is generally legal, it has ethical and legal ramifications that developers should not ignore. Recent history is full of examples of legal cases contesting the scraping of popular websites. It is therefore paramount to adhere to the boundaries of ethical web scraping to avoid issues. It is strongly recommended to:

Follow the instructions in the scraped website’s robots.txt Abide by the website's terms and conditions
Ask for permission from the website's owner when doing large scale scraping
Check for copyright violations: ensure that you do not reuse or republish the scraped data without verifying the website's license or having explicit permission from the data owner
Don't be greedy; only get the content you need.

Despite being frowned upon by website owners who implement all sorts of mechanisms to protect their content against it, scraping is an essential part of the web ecosystem. After all, were it not for web scraping we wouldn’t have search engines in their current form.

When done correctly and ethically, scraping contributes positively to the state of the Web as an open platform for information exchange.

Conclusion
Scraping data from the Web comes with several challenges. Ujeebu Scrape makes it less of a pain by handling all of these challenges so you can focus on the aspects of your project that matter the most.

Try us out. We handle millions of scraping requests everyday and have been doing this for more than 5 years for clients around the world. The first 5000 credits are on us. No credit card required.

This article was first published on ujeebu.com

On The Legality of Web Scraping

Sam Selman — Wed, 29 Nov 2023 15:58:37 +0000

The issues of legality and ethics surrounding web scraping are a massive grey area. While some may be in favor of web scraping, others might not share the same enthusiasm. This is what makes the subject so controversial.

Those in favor argue that web data has the potential to make the world better and that scraping is critical for data analysis and management done right. But on the other hand, critics object to the claim that web scraping gives an unfair advantage to scrapers.

The fact is that web scraping isn't bad as long as it's done properly. It can be beneficial for research purposes whether you want to promote your business or excel at academic projects.

In this post, we'll talk about which types of web scraping may be illegal, and the ruling of different authorities on its legality.

What Types Of Data Are Illegal To Scrap?

Unfortunately, many users are unaware that the final use case of the data has a significant influence on whether scraping is legal. The scraping of a website may be perfectly legal in some cases, but what you intend to do with the information makes it illegal in others.

There are two main types of data we must be concerned about:

Personal Data: Data that can be used directly or indirectly to identify an individual is personal data or personally identifiable information (PII). This includes medical or health records, bank information, date of birth, address, email, and name.

Copyrighted Data: This type of data is owned by businesses or people who have precise control over how it can be copied or captured. This is the same as using copyrighted images and songs. If you take the owner's data without permission, you could be breaking the law. Examples include articles and blogs, pictures, videos, music, and other creative property.

Web Scraping In The Eyes Of The Law

Before you start web scraping, reflect on the degree to which you can go to extract the data you need.

Currently, no legislation addresses web scraping directly, but several legal frameworks and broad principles have been applied in court over the use of scraped web data.

These court cases address illegal access to web data, copyright issues, trade secrets, and breach of contract issues.

Researchers and marketers must be aware of the possible ethical consequences of web scraping.

EU Laws

GDPR's jurisdiction makes up the entire European Economic Area (EEA). The GDPR has rules about protecting PII when data controllers get it and then give it to data processors.

The GDPR asserts that if there is a data breach, consumers and data security agencies must be told about it. If a company collects the PII of an EEA resident, it must follow the GDPR, no matter where it is in the world. There's no way around it.

The lawful bases of web scraping under Article 6 of GDPR include:

Consent: You are good to go if you have the consent of people whose websites you are scraping
Contract: This is when you are required by contract to scrape and process a website's data
Legal obligation: If scraping and processing web data help you fulfill a legal obligation, go ahead
Vital interests: If your scraping efforts can save lives, there is no doubt about their legality
Public tasks: It is perfectly legal when scraping is in the public interest or helps you do your duties as an official
Legitimate interest: As long as your web scraping doesn't override the rights or interests of people, you can argue that it is in your legitimate interest

US Laws

While the U.S. doesn't have anyone set federal privacy laws, it has a vast net of various state laws. That makes web scraping legality murky waters to navigate.

An example of this could be California Consumer Privacy Act (CCPA) and Computer Fraud and Abuse Act (CFAA). Moreover, the Health Insurance Portability and Accountability Act (HIPAA) and the Gramm-Leach-Bliley Act of 1999 (GLBA) are consumer-oriented federal laws.

CCPA: This is a state-wide data privacy law that helps regulate how businesses all over the country handle the P.I. of California residents. This was the pioneering data privacy law of the country
CFAA: It is concerned with authorization and data scraping cases that imply real property norms
HIPAA: This is a health insurance and accountability act that has set guidelines regarding patient privacy. A violation of these guidelines could result in federal prosecution
GLBA: This protects consumers' private information. To be GLBA compliant, firms need to inform customers of their right to opt-out if they don't want their personal information being used by financial firms

The CFAA and similar state laws are the leading legal basis for claims concerning web scraping disagreements. According to it, access to a website can be unauthorized when the website owner sends a cease and desist letter to anyone crawling or scraping. This is what happened in the case of Craigslist Inc. v. 3Taps Inc. in 2013 and Facebook, Inc. v. Power Ventures, Inc. in 2016. 3Taps is a firm committed to collecting and distributing public data. It is partnered with PadMapper. Craigslist sent the former a cease and desist letter in response to PadMapper using its listings. After the data distribution startup refused to comply, Craigslist registered a complaint with the U.S. District Court for Northern California.

However, the letter alone may not be enough to hold the web scraper responsible under the CFAA in some cases like Ticketmaster LLC v. Prestige Entertainment, Inc. in 2018. Ticketmaster took Prestige Entertainment to court over non-compliance of CFAA state laws; however, the defendants were able to circumvent the claims by stating that Prestige had acquired tickets through the Ticketmaster website— something that's permitted in its Terms of Use.

Comparing U.S., E.U., and Latin American Laws

It's a little challenging to compare E.U. and U.S. laws.

Both let people choose not to have their data processed. They can also delete their information or look at it.

In Europe, data protection laws are part of the GDPR, but there has never been a federal user privacy law in the U.S. Each state has tried to fill in the gap as they see fit. The CCPA is an example of this, but other states haven't shown the same amount of resolve. Another difference is that the CCPA requires privacy policies on all websites, whereas the GDPR needs clear and specific user consent.

Data Privacy is becoming more of an issue not only in the U.S. and Europe but also in Latin America. In fact, Brazil is leading the way with its new data privacy laws that need to be consolidated over 40 different regulations. Lei Geral de Proteção de Dados (LGPD) was set up on 2020 and puts significant compliance obligations on companies that process data.

How Can You Keep Your Scrapers Ethical?

Don't just pay lip service to ethical web scraping but make it an integral part of your data harvesting efforts.

The only mantra of ethical web scraping is: do no harm.

You have a lot of power as a web scraper because you'll likely come across loads of private user data and personal information of a website's users. That's why it is vital to have a moral code to guide your scraping efforts.

First off, make sure that you have a strict policy about not profiting off private data. Here's what you need to do next:

Use APIs

Some websites offer built-in APIs for scrapers. Make sure you use them and follow the rules. You could always use your API for web scraping, like the one from Ujeebu.

The Robots Exclusion Standard or the robots.txt file will tell you where to find the info you need and where you are allowed to go using your web-crawling software.

Read The Terms And Conditions

This is where you find the rules for using and scraping data from a website. Sure, you could always click 'I agree' without reading and do what you want to do. But it is essential to understand that the terms and conditions are there for a reason. So take your time to figure out how they affect you and what you are trying to do.

Be Kind

Scraping is harsh on web servers. So make sure you begin when there is little to no traffic on the website and be gentle when gathering data. Also, space out the requests so it doesn't look like you are trying to DDoS the servers.

Say Hi

The website admin will likely notice some unusual traffic when you start scraping. It'd be good to introduce yourself, tell them what you plan to do, and leave your contact info.

In fact, go a step further and courteously ask for permission. This will not only make you look like a nice person but also relieve some of the legal burdens. Besides, the data really doesn't belong to you, so it'd be the right thing to do.

The Bottom Line: Practice Ethical Scraping

The issue of legality boils down to what you scrape and how you go about it. Before embarking on your web scraping mission, be sure to give yourself a little ethics check. Ask yourself if you're about to scrap personal data, copyrighted data or if you're trying to gather data, usually behind a login.

It only takes good manners and a bit of due diligence to keep your web scraping efforts within ethical and legal confines.

Happy scraping!

This article first appeared here: https://ujeebu.com/blog/is-web-scraping-legal/