Richard Jedlička

Posted on Feb 21, 2022 • Edited on Mar 6, 2022

Web scraping with Node.js and Typescript - the scraper part (1/3)

Internet is full of information these days. Almost every website display them to the user in a human readable form. But what if you want to process these data programmatically, do some analysis, present them in a different form or store them in a database to make queries on them later? E.g. collect all the product names with a description, image and a price from your favorite online store. Well, you can open the page by page and copy&paste the data you need, but you won't 🤦‍♂️. What you definitely can and should is to check if the page has an API which will provide you the data easily. If not, I'm sorry bro there is no way to ... just kidding! 😝

... the web scraping comes into play. Yay!

👉 In this article series (3 parts) I will guide you through the whole process of building a web scraper in Node.js and Typescript.

In the first article you will learn how to scrape data from a single webpage. In the second article you I will teach you how to crawl the website to find and scrape all the wanted pages. And in the last article I will show you how to use a proxy with the scraper (coming soon) which can have some advantages in certain situations.

If you are a beginner or a more skilled programmer who is new to web scraping, transitioning from different programming language or just curious how others do it, you will benefit from this.

Prerequisites
What is web scraping exactly?
Let's scrape something!
Data model
Final code
Conclusion

Prerequisites

I assume you are familiar with Javascript and Typescript, know HTML and CSS selectors and have Node.js installed.

If not, check out these resources:

Javascript: https://javascript.info
Typescript: https://www.typescriptlang.org/docs
HTML/CSS: https://learn.shayhowe.com

What is web scraping exactly?

As you should know the websites are built using HTML and CSS. HTML describes the structure of the information in the page using tags. What web scraper does is extracting required information from the specified HTML tags. CSS selectors are very good way how to tell the scraper which tags to look at.

So the input for the scraper is the URL of a page (e.g. product detail). The scraper then loads the HTML source code, parse it, filters the tags specified by CSS selectors and extracts text from them. Then outputs the extracted data in a structured way (e.g. JSON). Easy, right?

Wait! You may ask ... Where do I get the page URLs? Do I have to copy&paste them to the crawler manually?

Of course not! Web scrapers are usually more robust and also contain the "crawler" part to automate the whole process.

The crawler will go through (crawls) the website and search for the pages which have the data to be scraped.

Actually, it is a special type of scraper which usually starts at homepage and looks for the hyperlinks according to specific rules and follows them and repeats the process until it finds the desired pages.

The term "web scraper" is often used interchangeably with "web crawler".

❗❗ Important thing to know is you should be careful when scraping any website. Web scraping isn’t illegal by itself but you should care about how you do it and what you do with the data. There is also an ethical side of it. Do not harm the website and check if you have the rights to use the data the way you are going to. Read more here: https://blog.apify.com/is-web-scraping-legal/. If you are not sure, ask your lawyer.

Disclaimer: I am not taking any responsibility for your web scraping activities. Do it at your own risk.

Let's scrape something!

🎓 For an example, consider we want to have a list of all European capital cities with a basic data like its name, name of the country, current population, area and an image of a city flag. The Wikipedia can be used as a good source of information.

First, init the project



npm init
npm install --save-dev typescript ts-node
npx tsc --init

and install the packages we will need.



npm install axios cheerio @types/cheerio

Axios is an HTTP client which we will use for fetching website data. It is more robust and feature-rich alternative to Fetch API.

Cheerio is a tool to parse HTML and gives you the ability to make queries on HTML tags and extract data from them. It is similar to jQuery but more suitable for server side.

💻 See the complete project in the GitHub repository

What are we going to scrape?

As we are prepared, we will start with a "scraper" part, so go and look at the capital city page we are going to scrape, e.g. https://en.wikipedia.org/wiki/Prague

There it is, the data we need. Ah, ok, but how do we know the location of the data in the page's HTML 🤔? Easily, we use dev tools. I'm using Chrome browser (other modern browsers usually have dev tools too) so right click the article's title element and select Inspect.

As you can see, the name of the city resides in <h1> tag with the ID firstHeading. I'm sure you are getting the idea.

First simple scraping

Stop talking and create some code!

Create a file index.ts and put in this code



import axios from 'axios';
import cheerio from 'cheerio';

export class CapitalCityScraper {
    async scrapeCity(url: string) {
        const response = await axios.get(url);
        const html = response.data;

        const $ = cheerio.load(html);

        const cityName = $('#firstHeading').text().trim();
        console.log(cityName);
    }
}

async function main() {
    const scraper = new CapitalCityScraper();
    await scraper.scrapeCity("https://en.wikipedia.org/wiki/Prague");
}

main();

💻 See the commit f13ccee0

Are you excited? Run the code



$ npx ts-node index.ts
Prague

🎉 Congratulations, your first scraping! Isn't it beautiful 🤩?

I think the code is quite self-explanatory, still I will go through some interesting moments



const response = await axios.get(url);
const html = response.data;

Axios makes an HTTP GET request to the specified URL and returns a promise which will hold the response with an HTML source code (in our case).

If you are not familiar with async/await check this https://javascript.info/async-await. Basically, it is very comfortable way to work with JS promises. The code "waits" until the promise is resolved and returns its data.



const $ = cheerio.load(html);

Cheerio parses the HTML and returns a querying function bound to a document based on that HTML markup. The querying function ($) accepts CSS selector and finds corresponding element(s) in the document.



const cityName = $('#firstHeading').text().trim();

Here we find the element with ID firstHeading and get its text content. It is also a good practice to trim the leading and trailing whitespace.

Ok, this was easy, right? Let's move on to something more difficult.

More difficult selector

The country's name is in <a> tag, but the tag has no ID or a class. We have to loot at its parent elements. The interesting one is the table row <tr> with a class mergedtoprow. But, there is a catch. If you look around, there are lots of table rows with the same class. Hmm 🤔, how do we select the correct row? Maybe we can use row's index? I wouldn't count on that as the other pages may have different number of info rows. I think there is no easy way with regular CSS selectors. What we can expect is that the row's label will always be "Country". Cheerio supports the same selectors as jQuery and it has a special selector :contains() (see jQuery doc) which checks if the element contains specific text. So the idea is we find the <td> element which is after the <th> element (row's label) containing text "Country".

Add this to the end of the scrapeCity method.



const country = $('.mergedtoprow th:contains(Country) + td').text().trim();
console.log(country);

💻 See the commit 07db4fbe

Run the code again



$ npx ts-node index.ts
Prague
Czech Republic

Nice!

Scraping elements "in context"

When you look at area and population, the rows we are interested in have the same label Capital city, therefore we can't use the same selector as for the country name directly. We need to find the relevant row according to the previous label Area or Population. You might be getting the impression that the row is nested inside a box, but, if you look closely, there are no boxes actually.

There are "top level" rows with mergedtoprow class which may have a "sub rows" with a class mergedrow. The "sub rows" are placed between two "top level" rows and relate to the first one. This is all we need to know.



const areaRows = $('.mergedtoprow th:contains(Area)').parent().nextUntil('.mergedtoprow');
const area = areaRows.find('th:contains(Capital city) + td').text().trim();
console.log(area);

First line find the "Area" label, the parent() method select the wrapping row and with nextUntil() we select all the next elements (rows) before next "top level" row. With this we get a context (areaRows) where we find the value with the same principle as for the country's name.

The same for population



const populationRows = $('.mergedtoprow th:contains(Population)').parent().nextUntil('.mergedtoprow');
const population = populationRows.find('th:contains(Capital city) + td').text().trim();
console.log(population);

💻 See the commit 95c3bb09

And after running



$ npx ts-node index.ts
Prague
Czech Republic
496 km2 (192 sq mi)
1,335,084

All right. We have got the information, but in a formatted shape 🤔. We want numbers!

Parsing the scraped data

It happens quite often when scraping that the information you scrape is formatted as human readable and not structured very well. So you have to make another step to parse (extract) the right data from the strings you scrape. In our case we want to have are as number of squared kilometers and the population as a count of persons.

Regular expressions for the win!

Modify the code slightly



const areaText = areaRows.find('th:contains(Capital city) + td').text().trim().replace(/ km2.*$/, '');
const area = parseFloat(areaText.replace(/,/g, ''));

const populationText = populationRows.find('th:contains(Capital city) + td').text().trim();
const population = parseFloat(populationText.replace(/,/g, ''));

Notice this will work for English localization, different languages can have numbers in different format.

In area text, we first drop everything from the unit to the end. And before converting to a number with parseFloat the commas must be removed.

💻 See the commit e88b6bae

Looks better now!



$ npx ts-node index.ts
Prague
Czech Republic
496
1335084

Scraping images

When scraping images, you can just scrape the image's URL or download the file itself. URL is fine if you want to display the image on another website or just want to store the link to it. But if you want to make some modifications to the image or you can't rely on the image's availability on the source website, you need to download it. I will show you the second case.

Still, we need to obtain the image's URL first. Let's analyze the HTML for the city flag.

The image of the flag is wrapped in <a> tag with image class which is in front of the <div> with a text "Flag". However, the <img> tag doesn’t keep the original SVG file, only the small PNG thumbnail. The anchor tag looks like it keeps the image’s URL.

Actually, it links to another webpage.

There it is. The <a> tag there has the URL we are looking for.

Get the flag image page URL.



const flagPageLink = $('.mergedtoprow a.image + div:contains(Flag)').prev().attr('href')!;
const flagPageUrl = new URL(flagPageLink, url).toString();
const flagImagePath = await this.scrapeImage(flagPageUrl);
console.log(flagImagePath);

I made the selector more universal by using a template string with a city's name. The flagPageLink variable keeps the relative path. The URL object will help us to obtain the full URL, the second argument is a base URL, the one of the city's wiki page in our case.

To make the scraper more organized, I moved the code for the image scraping into the separated method scrapeImage. The method can be used to scrape the image from any Wikipedia's image detail page.



protected async scrapeImage(url: string) {
    const response = await axios.get(url);
    const html = response.data;

    const $ = cheerio.load(html);

    const imageLink = $('#file a').attr('href')!;
    const imageUrl = new URL(imageLink, url).toString();

    const imagePath = await this.downloadFile(imageUrl, 'flags');

    return imagePath;
}

Everything should be already familiar to you. And again the code related to downloading of the image is separated to another method downloadFile.



protected async downloadFile(url: string, dir: string) {
    const response = await axios.get(url, {
        responseType: 'arraybuffer'
    });

    fs.mkdirSync(dir, {recursive: true});

    const filePath = path.join(dir, path.basename(url));
    fs.writeFileSync(filePath, response.data);

    return filePath;
}

This method is universal for downloading any file to a specified directory. The option responseType: 'arraybuffer' is crucial here. Axios will then consider the URL as a source of binary data and don't try to parse the response as a text.

💻 See the commit 2e0dec92

Now, if you run the code, you will see this



$ npx ts-node index.ts
Prague
Czech Republic
496
1335084
flags/Flag_of_Prague.svg

And if you look into the folder flags you will find the file Flag_of_Prague.svg here 🥳.

Data model

Great, we can scrape all the data we need. But all of them are just printed to the console in the moment when they are obtained. This is not good to work with. We want to return them in some form from our scrapeCity method. Plain object is sufficient.

For type safety, we will use an interface. Put it above the scraper class.



interface City {
    name: string;
    country: string;
    area: number;
    population: number;
    flagImagePath: string;
}

Remove all the console.log commands and put the this code at the end of scrapeCity function.



const city: City = {
    name: cityName,
    country,
    area,
    population,
    flagImagePath
};

return city;

Now this is much better, out scraped data has a specific shape and we can manipulate with them later. For now, we will just modify our main function to get the city object and print it to the console (in whole).



async function main() {
    const scraper = new CapitalCityScraper();
    const city = await scraper.scrapeCity("https://en.wikipedia.org/wiki/Prague");
    console.log(city);
}

💻 See the commit 96426d2d

Run the script.



$ npx ts-node index.ts
{
  name: 'Prague',
  country: 'Czech Republic',
  area: 496,
  population: 1335084,
  flagImagePath: 'flags/Flag_of_Prague.svg'
}

I feel quite satisfied now 😎. What about you?

Final code

💻 See the complete project in the GitHub repository



import fs from 'fs';
import path from 'path';
import axios from 'axios';
import cheerio from 'cheerio';

interface City {
    name: string;
    country: string;
    area: number;
    population: number;
    flagImagePath: string;
}

export class CapitalCityScraper {
    async scrapeCity(url: string) {
        const response = await axios.get(url);
        const html = response.data;

        const $ = cheerio.load(html);

        const cityName = $('#firstHeading').text().trim();

        const country = $('.mergedtoprow th:contains(Country) + td').text().trim();

        const areaRows = $('.mergedtoprow th:contains(Area)').parent().nextUntil('.mergedtoprow');
        const areaText = areaRows.find('th:contains(Capital city) + td').text().trim().replace(/ km2.*$/, '');
        const area = parseFloat(areaText.replace(/,/g, ''));

        const populationRows = $('.mergedtoprow th:contains(Population)').parent().nextUntil('.mergedtoprow');
        const populationText = populationRows.find('th:contains(Capital city) + td').text().trim();
        const population = parseFloat(populationText.replace(/,/g, ''));

        const flagPageLink = $('.mergedtoprow a.image + div:contains(Flag)').prev().attr('href')!;
        const flagPageUrl = new URL(flagPageLink, url).toString();
        const flagImagePath = await this.scrapeImage(flagPageUrl);

        const city: City = {
            name: cityName,
            country,
            area,
            population,
            flagImagePath
        };

        return city;
    }

    protected async scrapeImage(url: string) {
        const response = await axios.get(url);
        const html = response.data;

        const doc = cheerio.load(html);

        const imageLink = doc('#file a').attr('href')!;
        const imageUrl = new URL(imageLink, url).toString();

        const imagePath = await this.downloadFile(imageUrl, 'flags');

        return imagePath;
    }

    protected async downloadFile(url: string, dir: string) {
        const response = await axios.get(url, {
            responseType: 'arraybuffer'
        });

        fs.mkdirSync(dir, { recursive: true });

        const filePath = path.join(dir, path.basename(url));
        fs.writeFileSync(filePath, response.data);

        return filePath;
    }
}

async function main() {
    const scraper = new CapitalCityScraper();
    const city = await scraper.scrapeCity("https://en.wikipedia.org/wiki/Prague");
    console.log(city);
}

main();

Conclusion

Now you know how to scrape a web page in Javascript/Typescript. I hope you agree it is quite easy and fun.

Of course it depends on the website you want to scrape, the less the data you want are structured, the harder is it to get them. There are always many ways how to achieve the goal, sometimes it is straightforward, sometimes tricky. But if you managed it, the result can be quite satisfying when you are giving an order to something unorganised 😁.

Currently, we can handle a single capital city page only. In the next article I will teach you how to crawl the Wikipedia website to scrape all of them.

Web scraping with Node.js and Typescript - the crawler part (2/3)

Richard Jedlička ・ Mar 6 '22

Top comments (1)

Crawlbase • Feb 29 '24

Lots of thanks for such insightful dive into the world of web scraping with Node.js and TypeScript! This comprehensive guide solves the complexities of scraping data from web pages, empowering us to extract valuable information. If you're on the lookout for a dependable scraping companion, Crawlbase might just be the missing piece to your toolkit. Keep scraping, keep innovating!