DEV Community

loading...

Data Scraping and Data Crawling, what are they for?

Salvietta150x40
Web Developer, Creative Thinker, italian expat, founder of @manoweb
Originally published at ma-no.org ・5 min read

Data Scraping and Data Crawling, what are they for?

Right now we are in an era where big data has become very important. At this very moment, data is being collected from millions of individual users and companies. In this tutorial we will briefly explain big data, as well as talk in detail about web crawling and web scraping in the business world.

Many of you will have heard about the importance of big data in today's context. It is especially related to the creation, collection and analysis of information on the web. However, one thing that many of you will not know, is that all companies today can take advantage of this data, so they can make an economic profit from it.

Recent research has found that organisations that employ data-based market research techniques are more successful. In that sense, they outperform their competitors by 85% in sales growth, and, in addition, they obtain a 25% gross margin in profits.

Revenue increases are certainly impressive, but on the other hand, long-term growth is also a critical factor in determining the success of a business. An organization with profits can better cope with the future and economic crises. Thus, using these techniques of web crawling and web scraping can get between 25 and 30% more profits per year.

Before starting with web crawling and web scraping, let's explain what big data is and then make them easier to understand.

 

Big data and data collection

 

The transition to the digital world is bringing about many changes in the way we work and in society. Thanks to applications, smartphones, PCs, other devices and websites, the amount of data we generate when we are connected to the Internet is increasing.

Big Data could be defined as the capacity to process, or treat, very large volumes of data with relative ease. Thus, our objective is to take advantage of the greatest amount of information that there is within this data.

It also covers the study of these data to look for patterns in them. It is a way of processing information to try to discover something useful in it. The way to work with the big data or macro data would be as follows:

  1. Capturing and obtaining data.
  2. The data we have obtained is ordered and separated into smaller units, so that it is easier to analyse it.
  3. We create an index of the data so that finding the information is quicker and easier.
  4. We store the data.
  5. We analyse the data using a large number of algorithms to find the data we are interested in.
  6. We visualise the results.

One of the ways to manage this data, would be through the use of web crawling and web scraping that we will talk about in detail later. The improvement of the hardware together with the use of the two techniques mentioned above has made it a reality that the use of the data we generate can be used for commercial purposes.

 

Web crawling: what it is and how it works

 

Web crawling could be defined as a way to obtain a map of the territory. Let's try to explain this concept by using a symbolic example. For a moment, let's imagine that we start from a treasure map that contains chests of precious stones.

If we want that treasure map to be valuable, then it must be accurate. In that sense, we need someone to travel to that unknown area to assess and record all the necessary aspects on the ground.

In that sense, the ones in charge of this crawling are the bots, and they will be the ones in charge of creating that map. Their way of working would be to scan, index and record all the websites, including pages and sub-pages. This information is then stored and requested each time a user performs a search related to the topic.

An example of crawlers used by large companies are:

  • Google has "Googlebot"
  • Microsoft's Bing uses "Bingbot
  • Yahoo uses "Slurp Bot"

The use of bots is not exclusive to Internet search engines, although it seems to be so, for the example of crawlers that we put before. Other sites, too, sometimes use crawling software to update their own web content or index the content of other websites.

One thing to keep in mind is that these bots visit websites without permission. Owners of these sites who prefer not to be indexed can customize the robots.txt file with requests not to be crawled.

 

What is web scraping and how it differs from web crawling

 

On the other hand we have web scraping, which although they crawl the Internet like bots, have a more defined purpose, which is to find specific information. Here we are also going to give a simple example to help us understand them.

A simple definition of a web scraper could be that of a normal person who wants to buy a motorbike. So what he would do is look for information manually and record the details of that item such as make, model, price, colour etc on a spreadsheet. That person also looks at the rest of the content such as advertisements and company information. However, that information would not be recorded, they know exactly what information they want and where to look for it.

Web scraping tools work in the same way, using code or "scripts" to extract specific information from websites they visit.

We should not forget that the skill of the person looking for this prize plays an important role in the amount of treasure or bargains they will find. In that sense, the more intelligent the tool is, the more quality information we will be able to obtain. Better information means being able to have a better strategy for the future and obtain more benefits.

 

Who can take advantage of web scraping and its future

 

Whichever business you're in, web scraping can give your business an edge over the competition by providing the most relevant industry data.

The list of uses that the web scraping can offer us can include:

  1. Price intelligence for e-commerce companies to adjust prices in order to beat the competition.
  2. Scanning of competitor's product catalogues and stock inventory to optimize our company's strategy.
  3. Price comparison websites that publish data on products and services from different suppliers.
  4. Travel websites that obtain data on flight and accommodation prices, as well as real-time flight tracking information.
  5. Help our company's human resources section to scan public profiles for candidates.
  6. We could also track mentions on social networks to mitigate any negative publicity and collect positive reviews.

The list of uses that the web scraping can offer us can include:

Price intelligence for e-commerce companies to adjust prices in order to beat the competition.
Scanning of competitor's product catalogues and stock inventory to optimize our company's strategy.
Price comparison websites that publish data on products and services from different suppliers.
Travel websites that obtain data on flight and accommodation prices, as well as real-time flight tracking information
Help our company's human resources section to scan public profiles for candidates.
We could also track mentions on social networks to mitigate any negative publicity and collect positive reviews.

 

Discussion (0)