Dmitry Narizhnyhkh

Posted on Jun 8, 2020 • Originally published at blog.dataflowkit.com on Jun 7, 2020

What is a present-day web scraper in 2020?

#webscraping #tutorial #discuss #datascience

Many web sites like Twitter, YouTube, or Facebook provide an easy way to access their data through a public API. All the information that you obtained using API is both well structured and normalized. For example, it can be in the format of JSON, CSV, or XML.

3 ways to extract data from any website.

#1 Official API.

First of all, you should always check out if there's an official API that you can use to get the desired data.

Sometimes the official API is not updated accurately, or some of the data are missing from it.

#2 "Hidden API".

The backend might generate data in JSON or XML format, consumed by the frontend.

Investigating XMLHttpRequest (XHR) with a web browser inspector gives us another way to access the data. It would provide us the data in the same way as an official API would do it.

How to get this data? Let's hunt for API endpoint!

For example, let's look at https://covid-19.dataflowkit.com/ resource showing local COVID-19 cases for website visitors.

Call Chrome DevTools by pressing Ctrl+Shift+I
Once the console appears, go to the "Network" tab.
Let's select the XHR filter to catch an API endpoint as the "XHR" request if it is available."
Make sure the "recording" button is enabled.
Refresh the webpage.
Click Stop "recording" when you see the data related content has already appeared on the webpage.

Now you can see a list of requests on the left. Investigate them. The preview tab shows an array of values for the item named "v1."

Press the "Headers" tab to see details of the request. The most important thing for us is the URL. Request URL for "v1" is https://covid-19.dataflowkit.com/v1.

Now, let's just open that URL as another browser tab to see what happens.

Cool! That's what we're looking for.

Taking data either directly from an API or using the technique described above, is the easiest way to download datasets from websites. Of course, theses approaches are not going to be effective for all the website, and that is why the web scraping libraries are still necessary.

Web data extraction or web scraping is the only way to get desired data if owners of a web site don't grant access to their users through API. Web Scraping is the data extraction technique that substitutes manual repetitive typing or copy-pasting.

#3 Website scraping.

Know the rules!

What should you check before scraping a website?

What is a present-day web scraper in 2020? — Photo by Adam Sherez / Unsplash

☑️ Robots.txt is the first thing to check when you plan to scrape website data. Robots.txt file lists the rules on how you or a bot should interact with them. You should always respect and follow all the rules listed in robots.txt.

☑️ Make sure you also look at a site's Terms of use. If terms of use provision do not say that it limits access to bots and spiders and does not prohibit rapid requests of the server, crawling is fine.

☑️ To be compliant with the new EU General Data Protection Regulation, or GDPR , you should first evaluate your web scrapping project.

If you don't scrape personal data, then GDPR does not apply. In this case, you can skip this section and move to the next step.

☑️ Be careful about how you use the extracted data as you may violate the copyrights sometimes. If the terms of use do not provide a limitation on a particular use of the data, anything goes so long as the crawler does not violate copyright.

Find more information: Is web scraping legal or not?

Sitemaps

Typical websites have sitemap files containing a list of links belong to this web site. They help to make it easier for search engines to crawl web sites and index their pages. Getting URLs from sitemaps to crawl is always much faster than gathering it sequentially with a web scraper.

Render JavaScript-driven web sites

JavaScript Frameworks like Angular, React, Vue.js used widely for building modern web applications. In short, a typical web application frontend consists of HTML + JS code + CSS Styles. Usually, source HTML initially does not contain all the actual content. During a web page download, HTML DOM elements are loaded dynamically along with rendering JavaScript code. As a result, we get rendered static HTML.

☑️ You can use Selenium for website scraping, but it is not a good idea. Many tutorials are teaching how to use Selenium for scraping data from websites. Their home page clearly states that Selenium is " for automating web applications for testing purposes ."

☑️ PhantomJS was suitable to take care of such tasks earlier, but since 2018 its development has been suspended.

☑️ Alternatively, Scrapinghub's Splash was an option for Python programmers before Headless Chrome.

Your browser is a website scraper by its nature. The best way nowadays is to use Headless Chrome as it renders web pages "natively."

Puppeteer Node library is the best choice for Javascript developers to control Chrome over DevTools Protocol.

Go developers have an option to choose from either chromedp or cdp to access Chrome via DevTools protocol.

Be smart. Don't let them block you.

Some web sites use anti-scraping techniques to prevent web scrapper tools from harvesting online data. Web scraping is always a "cat and mouse" game. So when building a web scraper, consider the following ways to avoid getting blocked. Or you risk not receiving the desired results.

Tip #1: Make random delays between requests.

When a human visits a web site, the speed of accessing different pages is in times less compared to a web crawler's one. Web scraper, on the opposite, can extract several pages simultaneously in no time. Huge traffic coming to the site in a short period on time looks suspicious.

You should find out the ideal crawling speed that is individual for each website. To mimic human user behavior, you can add random delays between requests.

Don't create excessive load for the site. Be polite to the site that you extract data from so that you can keep scraping it without getting blocked.

Tip #2: Change User-agents.

When a browser connects to a web site, it passes the User-Agent (UA) string in the HTTP header. This field identifies the browser, its version number, and a host operating system.

A typical user agent string looks like this: "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36".

If multiple requests to the same domain consist of the same user-agent, the web site can detect and block you very soon.
Some websites block specific requests if they contain User-Agent that differ from a general browser.
If "user-agent" value is missed, many websites won't allow accessing their content.

What is the solution?

You have to build a list of user-agents and rotate them randomly.

Tip #3: Rotate IP addresses. Use Proxy servers.

If you send multiple requests from the same IP address during scraping, the website considers suspicious behavior and blocks you.

For the most straightforward cases, it is enough to use the cheapest Datacenter proxies. But some websites have advanced bot detection algorithms, so you have to use either residential or mobile proxies to scrape them.

For example, someone in Europe wants to extract data from a website with limited access to US users only. It is evident to make requests through a proxy server located in the USA since their traffic seems to be coming from the local to US IP address.

To obtain country-specific versions of target websites, just specify any arbitrary country in request parameters in Dataflow Kit fetch service.

Tip #4: Avoid scraping patterns. Imitate humans behavior.

Humans are not consistent while navigating a website. They do different random actions like clicks on the page and mouse movements.

In opposite, web scraping bots follow specified patterns when crawling a web site.

Teach your scraper to imitate human beings' behavior. This way, website bot detection algorithms don't have any reason to block you from automation your scraping tasks.

Tip #5: Keep eyes on anti-scraping tools.

One of the most frequently used tools for the detection of hacking or web scraping attempts is the "honey pot." The honey pots are not visible to the human eye but can be seen by bots or web scrapers. Right after your scraper clicks such a hidden link, the site blocks you quite easily.

Find out whether a link has the "display: none" or "visibility: hidden" CSS properties set if they do just stop following that link. Otherwise, a site immediately identifies you as a bot or scraper, fingerprints the properties of your requests, and bans you.

Tip #6: Solve online CAPTCHAs.

While scraping a website on a large scale, there is a chance to be blocked by a website. Then you start seeing captcha pages instead of web pages.

CAPTCHA is a test used by websites to battle back against bots and crawlers, asking website visitors to prove they're human before proceeding.

Many websites use reCAPTCHA from Google. The last version v3 of reCAPTCHA analyses human behavior and require them to tick "I'm not a robot" box.

CAPTCHA solving services use two methods for solving CAPTCHAs:

☑️ Human-based CAPTCHA Solving Services

When you send your CAPTCHA to such service, human workers solve a CAPTCHA and send it back.

☑️ OCR (Optical Character Recognition) Solutions

In this case, OCR technology is used to solve CAPTCHAs automatically.

Point-and-click visual selector.

Of course, we don't intend only to download and render JavaScript-driven web pages but to extract structured data from them.

Before starting of data extraction, let's specify patterns of data. Look at the sample screenshot taken from web store selling smartphones. We want to scrape the Image, Title of an item, and its Price.

Google chrome inspect tool does a great job of investigating the DOM structure of HTML web pages.

With the Chrome Inspect tool, you can easily find and copy either CSS Selector or XPath of specified DOM elements on the web page.

Usually, when scraping a web page, you have more than one similar block of data to extract. Often you crawl several pages during one scraping session.

Surely, you can use Chrome Inspector to build a payload for scraping. In some complex cases, it is only a way to investigate particular element properties on a web page.

Though modern online web scrapers, in most cases, offer a more comfortable way to specify patterns (CSS Selectors or XPath) for data scraping, set up pagination rules, and rules for processing detailed pages on its way.

Look at this video to find out how it works.

Manage your Data Storage strategy.

The most well-known simple data formats for storing structured data nowadays include CSV, Excel, JSON (Lines). Extracted data may be encoded to destination format right after parsing a web page. These formats are suitable for use as low sized volumes storages.

Crawling a few pages may be easy, but millions of pages require different approaches.

How to crawl several million pages and extract tens of million records?

What to do if the size of output data is from moderate to huge?

Choose the right format as output data.

Format #1. Comma Separated Values (CSV) format

CSV is the most simple human-readable data exchange format. Each line of the file is a data record. Each record consists of an identical list of fields separated by commas.

Here is a list of families represented as CSV data:

id,father,mother,children
1,Mark,Charlotte,1
2,John,Ann,3
3,Bob,Monika,2

CSV is limited to store two-dimensional, untyped data. There is no way to specify nested structures or types of values like the names of children in plain CSV.

Format #2. JSON

[
   {
      "id":1,
      "father":"Mark",
      "mother":"Charlotte",
      "children":[
         "Tom"
      ]
   },
   {
      "id":2,
      "father":"John",
      "mother":"Ann",
      "children":[
         "Jessika",
         "Antony",
         "Jack"
      ]
   },
   {
      "id":3,
      "father":"Bob",
      "mother":"Monika",
      "children":[
         "Jerry",
         "Karol"
      ]
   }
]

Representing nested structures in JSON files is easy, however.

Nowadays*, JavaScript Object Notation (JSON)* became a de-facto of data exchange format standard, replacing XML in most cases.

One of our projects consists of 3 Millions of parsed pages. As a result, the size of the final JSON is more than 700 Mb.

The problem arises when you have to deal with such sized JSONs. To insert or read a record from a JSON array, you need to parse the whole file every time, which is far from ideal.

Format #3. JSON Lines

Let's look into what JSON Lines format is, and how it compares to traditional JSON. It is already common in the industry to use JSON Lines. Logstash and Docker store logs as JSON Lines.

The same list of families expressed as a JSON Lines format looks like this:

{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

JSON Lines consists of several lines in which each line is a valid JSON object, separated by the newline character \n.

Since every entry in JSON Lines is a valid JSON, you can parse every line as a standalone JSON document. For example, you can seek within it, split a 10gb file into smaller files without parsing the entire thing. You can read as many lines as needed to get the same amount of records.

Summary

A good scraping platform should:

☑️ Fetch and extract data from web pages concurrently.

We use concurrency features of Golang, and found them fantastic;

☑️ Persist extracted blocks of scraped data in the central database regularly.

This way, you don't have to store much data in the RAM while scraping many pages. Besides, it is easy to export data to different formats several times later. We use MongoDB as our central storage.

☑️ Be web-based.

Online Website scraper is accessible anywhere from any device which can connect to the internet. Different operating systems aren't an issue anymore. It's all about the browser.

☑️ Be cloud-friendly.

It should provide a way to quickly scale up or down cloud capacity according to the current requirement of a web data extraction project.

Conclusion

In this post, I tried to explain how to scrape web pages in the year 2020. But before considering scraping, try to find out official API exists or hunt for some "hidden" API endpoints.

I would appreciate it if you could take a minute to tell me which one of the web scraping methods you use the most in 2020. Just leave me a comment below.

Happy scraping!

Click To Tweet

Top comments (1)

Chetan • May 15 '23

I encourage you to use Bose framework that solves many of scraping problems at omkar.cloud/bose/