Web scraping is a powerful tool you can leverage for a variety of use cases, from aggregating data to generating sales leads or training a machine learning model.
Implementing it properly can be very challenging, though. You have to make sure you get the correct data efficiently and reliably, and you don't break any laws in the process.
In this article, we'll look at 17 best practices covering the following aspects of web scraping:
- Data extraction
- Avoiding getting blocked
- Session and data management
- Legality and ethics
Data extraction
First, let's discuss a few ways you can make sure you extract the correct data, and as efficiently as possible.
1. Check if they provide an API
An API (Application Programming Interface) offers a way for applications to communicate with each other. This means that it returns data in a structured format (mostly JSON for modern services), decreasing the number of requests you have to initiate considerably.
Ideally, they publish an official API. You will usually need to get an API key for access, but it will ensure that:
- you get relevant data in a structured format
- you're legally allowed to get the data
If they don't publish an API, look for whether the page you're trying to scrape loads its data (or part of it) using a "hidden" API. You can find this out if you open DevTools, go to the Network tab, refresh the page, then look for XHR requests that are fired after the page loads.
2. Prefer CSS selectors over XPath
All right, so they don't provide APIs. This means you need to parse the page to extract data from it.
Page elements can be identified in several ways, the two most common being CSS selectors and XPath.
In general, you should use CSS selectors because they are much more flexible and are parsed faster by modern browsers and libraries. One of their drawbacks is that you can't navigate upwards in the DOM tree (from child to parent).
3. Come up with robust CSS selectors
Craft your CSS selectors in such a way that they point to correct data across all pages.
If you're extracting a single element, see if it has an ID, a unique class, or a unique attribute, across the page. Often, there are standard ways to express common page elements, such as a[rel="next"] for the next page link in the pagination controls.
If you're extracting multiple similar elements, find the common CSS class or attribute value among them.
Try to avoid using pseudo-classes like :nth-child() or :nth-of-type() because they don't always point to the correct elements across pages, so you might get wrong data (which is worse than getting no data).
Of course, you should always test your selectors by scraping several pages and checking that you consistently get the correct data.
4. Don't use headless browsers unless you need to
Some sites fetch part of the data asynchronously after the page is loaded. This will be a problem for static HTML parsers simply because the data is not there initially.
In these cases, you can use libraries such as Puppeteer or Selenium that interact with headless browsers. They render Javascript and will allow you to wait until the page into Idle state (there are no more network requests).
Another case when you need to use them is if the site employs anti-scraping measures such as Cloudflare or PerimeterX that detect static parsers and will block the requests by serving CAPTCHAs or outright returning 403 (Forbidden).
The problem with headless browsers is performance. They are considerably slower than static parsers and consume orders of magnitude more memory.
So always go with static parsing first and see if you get all data you need.
Avoiding getting blocked
Modern sites use anti-bot measures such as CAPTCHAs and honeypots to prevent you from scraping them. This makes perfect sense because, if abused, a scraper can be detrimental to the site's performance or even to the business itself (if it steals and republishes content, for instance).
Following these practices will decrease the likelihood of getting blocked, though not completely eliminate it.
5. Use real user agents and rotate them
The User-Agent header is used for transmitting information about the browser and the operating system that issued the request.
Passing fake or incomplete values for this header will quickly result in the site detecting you as a bot, so you should always use real values. Use libraries such as random-useragent to generate them.
You should also rotate them regularly to reduce the likelihood of detection.
6. Use quality proxies and rotate them
Most serious scraping projects will require you to rely on proxies. They will help you bypass rate limits, and avoid your IP getting banned. They also allow you to get access to geo-targeted content.
For this, choosing a good proxy provider is crucial. I recommend BrightData. They are one of the leading providers with a huge pool of both datacenter and residential proxies, they rotate them for you out of the box, and they offer powerful configuration options.
7. Randomize intervals between requests
Sites can also detect that you're a bot if you use the same interval between consecutive requests since that's not really human behavior.
Therefore, make sure you mix things up by using a random number for the delay. Start with something between 2-5 seconds, see if you get blocked, and use longer delays in that case.
8. Leverage evasion techniques
Browsers give away a lot of unique cues to websites: the browser's type, the operating system, timezone, screen resolution, whether there is an ad-blocker used, and many other things. This is called Fingerprinting and it's very reliable. In fact, sites can identify you with a 90-99% accuracy.
There are popular libraries and plugins such as puppeteer-extra-plugin-stealth that evade many of the common fingerprinting techniques, so I definitely recommend using them.
9. Use CAPTCHA solvers
With the practices above, even though you can reduce the chance of running into CAPTCHAs, you cannot completely avoid them.
If a CAPTCHA is served, the only way to get around it automatically is to leverage CAPTCHA-solving services, such as 2Captcha. These are paid because they delegate the solving to humans, whom they compensate for it. Even though artificial intelligence is constantly evolving, there are some types of images computer vision just cannot recognize yet.
There is a nice library called puppeteer-extra-plugin-recaptcha that integrates Puppeteer with these services and solves CAPTCHAs automatically for you. Of course, you need to be subscribed to the service and configure the integration with your API key.
Session and data management
Next, let's look at some practices for managing the scraping sessions and the data. Most modern scraping tools will definitely implement them.
10. Persist often to avoid data loss
Scraping thousands of pages only to lose all data because the scraper failed with an error and you forgot to save the data would be a bummer.
So make sure you save the data often, either in a file (which you can open in append mode and periodically write into it) or in a relational database such as MySQL.
DataGrab saves data in small batches to a MongoDB database, a document store ideal for persisting arbitrarily nested data structures.
11. Implement a robust logging system
Say you scraped 100 pages and 20 of them failed. How do you find out what went wrong? Some URLs might have been broken links, or you might have been blocked.
This is where detailed logs can help. At a minimum, you (or the tool you use) should log the URL, status code, and error message for each failed request, along with the timestamp of when it was performed.
12. Use the right data format
For exporting scraped data, make sure you use a format that is suitable based on the structure of the data.
For exporting tabular data, I recommend using CSV. It's one of the most widely used formats with many tools at your disposal for importing, visualizing, and even editing the data.
If your data resembles a tree structure (you scraped social media posts with all of their comments, for instance), JSON is more suitable. It allows expressing nested data in a compact way. It is what API endpoints most often use as well.
Legality and Ethics
Finally, let's cover the legal and ethical side of things. Following these practices will ensure that you won't fall to the dark side of the Force and wake up with a Cease & Desist letter or with a lawsuit.
Disclaimer: I am not a lawyer. These practices give you a simplified view of what's ethical and legal in most jurisdictions. For a detailed analysis of your case, always consult a licensed lawyer.
13. Don't harm the website
One of the worst things you could do is to flood the site with so many concurrent requests that they can't serve legitimate users browsing the site. This is called a Denial-of-Service attack and it's a crime punishable by law.
You can avoid it in three ways:
- Scrape slowly. Do a bit of research to estimate how much traffic the site can handle. Big sites don't break a sweat for 1000 requests/sec, but the same rate can seriously burden a small server.
- Avoid peak hours. Find out their peak hours and avoid scraping at those times. Look up the server's timezone and make an educated guess. Look for users complaining about degraded performance in certain time intervals on forums or social media.
- Divide your scraping into multiple sessions. You don't have to get everything in one run, you know. Partition your scraping sessions and leave a few hours between them.
14. Respect the site's Terms of Service
Many site owners got tired of people scraping their sites so they protect the data by requiring you to sign up before access and accept their Terms of Service in the process. If they have a clause in the ToS that forbids using automated tools to extract data, the site is off-limits.
This is important because the Terms of Service document is enforceable in a court of law, provided that the site clearly displays it in the sign-up process and the user has to explicitly accept it.
15. Don't plagiarize content
Stealing data and republishing it in its original form is copyright infringement. Definitely avoid that. Always consider whether your project could be detrimental to the business of the site you're scraping.
Data can be transformed in many innovative ways. Use it to gather insights, calculate statistics, or train an ML model.
16. Don't scrape personal information
Another kind of data you should avoid scraping is PII (Personally Identifiable Information), which is basically any data by which you can infer the identity of an individual. This can be the person's name, address, financial information, login ID, or even video footage.
There are regulations such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act) that require organizations to ask for explicit consent before storing personal information and also to ensure it is kept securely.
17. Adhere to Robots.txt
Robots.txt is a text file that advises crawlers about what pages they are allowed to scrape, and which ones are off-limits. It is placed in the root directory of the site and it follows a special syntax. For example, the following configuration will allow crawling any HTML page but will disallow crawling anything in the private directory.
User-agent: *
Allow: /*.html$
Disallow: /private/*
When scraping, you should always respect the Robots.txt of the site. Disallowed sections are usually internal pages that don't have public data anyway.
Conclusion
Web scraping can be challenging, but luckily, there are some standard practices you can follow to ensure that you get the correct data in an efficient and ethical manner.
Thanks for reading this article! :)
What other practices or techniques would you add?
Top comments (0)