Hello, I'm glad to see you here again 🙂. Since the previous article, you've known how to scrape a single webpage. Today, I will teach you how to find all the pages you want to scrape and scrape them.
❗❗ Important thing to remind. You should be careful when scraping any website. Web scraping isn’t illegal by itself but you should care about how you do it and what you do with the data. There is also an ethical side of it. Do not harm the website and check if you have the rights to use the data the way you are going to. Read more here: https://blog.apify.com/is-web-scraping-legal/. If you are not sure, ask your lawyer.
Disclaimer: I am not taking any responsibility for your web scraping activities. Do it at your own risk.
How web crawling works?
The crawling process use similar principles as you saw when scraping. You look for the HTML link elements in the page and instead of storing their data, you follow them. You repeat this process until you find the right pages. There are two approaches how to do it. It depends on how the website is structured and what do you know about it.
-
First approach is you can build your crawler that it starts at some page (e.g. homepage) and follows every link (probably only internal so you stay on the same website) it finds. And when it detects the wanted page, it will scrape it. This requires you to know how to detect the right page and it can be time-consuming because it can crawl the entire website and visit irrelevant pages.
Search engines do this because they want to visit every page to scrape and index its metadata.
-
Second approach is you will crawl only a needed minimum of pages. This requires you to know the website's structure. Most of the pages have some kind of tree-like hierarchy, so you can start somewhere in the middle and only "go down".
E.g. online store, where homepage links to categories, categories link to sub-categories and products' details. So if you want to scrape only mobile phones' data, you will start at mobile phones category page and follow only the sub-category and product links.
Let's crawl
🎓 Our goal now is to find all the URLs of European capital city pages on Wikipedia.
Wikipedia is full of user generated content, and doesn't have any strict rules for its structure. It is like a network of pages where each page references other related pages. It seem we need the first approach here. Fortunately, there are some pages which can help us, like the List of European countries by area with this nice table.
Each country is linked to its wiki page, where we can find a link to a capital city's page. This gives us the hierarchy, and the second approach can be used here 🤞.
Finding URLs
Let's start coding! We need to scrape the country links.
💻 See the complete project in the GitHub repository
Take a look at the HTML code.
Links are located in the second <td>
of rows of the <table>
with class wikitable
.
Add new method crawlCountries
to our scraper class.
async crawlCountries(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const links = $('.wikitable td:nth-child(2) > a').get();
for (let link of links) {
const href = $(link).attr('href')!;
const countryPageUrl = new URL(href, url).toString();
console.log(countryPageUrl);
}
}
And modify the main()
function.
async function main() {
const scraper = new CapitalCityScraper();
await scraper.crawlCountries("https://en.wikipedia.org/wiki/List_of_European_countries_by_area");
}
💻 See the commit 02efe310
I should explain a bit more how Cheerio works. As I mentioned in the previous article, cheerio.load()
creates a querying function bound to a document based on the provided HTML markup. This function, when called with a CSS selector, returns a Cheerio object representing a set of matched elements. This object provides convenient methods to make operations on this set. Some of them work only with a single element like .attr(...)
which returns the specified attribute of the first matched element. Some of them work with the whole set like .text()
which returns a combined text content of all matched elements.
To get the matched elements in a raw form, the .get()
method is used. When called without an argument, it returns an array of matched elements. Each element is a DOM-like object, you can work with it directly or if you pass it the Cheerio querying function ($
) you will get the Cheerio object representing this single element (as you can see in the for
loop).
Run the code and you will see the list of country page URLs 🚩.
$ npx ts-node index.ts
https://en.wikipedia.org/wiki/Russia
https://en.wikipedia.org/wiki/Ukraine
https://en.wikipedia.org/wiki/France
https://en.wikipedia.org/wiki/Spain
https://en.wikipedia.org/wiki/Sweden
...
What now? Instead of printing the URL we would like to visit it and find the link to the capital city page. Let's take one of the URLs and do the inspection 🕵️.
It looks similar to the data we scraped in the previous article. Create another method for that.
async crawlCountry(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const capitalCityRow = $('.infobox th:contains(Capital)').parent();
if (capitalCityRow.length === 0) {
// city state is capitaly city itself
return url;
}
const capitalCityLink = capitalCityRow.find('td > a').attr('href');
if (!capitalCityLink) {
// in some cases the capital city cannot be scraped
// (possible, but too complicated for this article)
return undefined;
}
const capitalCityUrl = new URL(capitalCityLink, url).toString();
return capitalCityUrl;
}
City states are the capital cities by themselves, so we return the same URL. And, in some cases, the capital city cannot be scraped (explained below).
🙁️️ Due to the lack of strict structure on Wikipedia, scraping of the capital city link is too complicated in some cases, so I decided to sacrifice the completeness for the code simplicity. If you are interested in more information and possible solutions, ask me in the comments.
Replace the console.log
in the crawlCountries
method with
const capitalCityPageUrl = await this.crawlCountry(countryPageUrl);
if (capitalCityPageUrl) {
console.log(capitalCityPageUrl);
}
💻 See the commit f3f4da52
Now run the code and see what's happening.
$ npx ts-node index.ts
https://en.wikipedia.org/wiki/Moscow
https://en.wikipedia.org/wiki/Kyiv
https://en.wikipedia.org/wiki/Paris
https://en.wikipedia.org/wiki/Madrid
https://en.wikipedia.org/wiki/Stockholm
...
The URLs of capital cities are printed line by line and there is a delay between each. It is because we are making many HTTP requests, one for each country, and it takes time for the server (Wikipedia) to respond.
❗❗ Think about a server's performance. Try to be gentle and don't hit the server with too many requests in a short time. It can lead to server exhaustion, so the website is not usable when you are scraping and the server may have some protection again it and will block you. If it's not necessary, scrape rather slowly by adding additional delay between requests.
Generators
Our crawler goal is almost finished. We are able to find all the URLs of capital city pages. Now, we need to get the list of URLs out of the crawlCountries
method, so we can later loop through them. But how? 🤔
We can store the URLs in an array and return it whole.
async crawlCountries(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const links = $('.wikitable td:nth-child(2) > a').get();
const urls: string[] = [];
for (let link of links) {
const href = $(link).attr('href')!;
const countryPageUrl = new URL(href, url).toString();
const capitalCityPageUrl = await this.crawlCountry(countryPageUrl);
if (capitalCityPageUrl) {
urls.push(capitalCityPageUrl);
}
}
return urls;
}
async function main() {
for (let url of await scraper.crawlCountries("https://en.wikipedia.org/wiki/List_of_European_countries_by_area")) {
console.log(url);
}
}
💻 See the commit a3ae3829
This looks fine and it works, but there is a catch. The problem is we have to build the whole array before we can return it from the function. What does it mean? It means we have to wait until the "crawler" finds all the URLs before we can start using them. In our case, it takes not such a long time, but imagine we are crawling hundreds or thousands of pages. It can take very long time, in the order of minutes or hours.
Try to run the code and you will see. The output is the same as before but the timing has changed. Nothing is happening for some time and then the whole list is printed.
Modern versions of JavaScript and TypeScript offer one very useful construct, generators. It allows us to "return" values from a function multiple times and even before it has finished. To make it work, you must mark the function as generator function by prefixing its name with
*
and instead ofreturn
keyword you use ayield
. The function then actually returns an iterator, an object on which you call.next()
to get the next value. You can iterate (loop trough) the iterator same way as an array (e.g. in afor
loop). The advantage here is that we have access to the "returned" value immediately after yielding. You can imagine it as the function pauses in the place of theyield
statement and waits until someone requires the next value from the iterator, and then continue. Read more about generators here.
So let's try the generators instead.
async *crawlCountries(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const links = $('.wikitable td:nth-child(2) > a').get();
for (let link of links) {
const href = $(link).attr('href')!;
const countryPageUrl = new URL(href, url).toString();
const capitalCityPageUrl = await this.crawlCountry(countryPageUrl);
if (capitalCityPageUrl) {
yield capitalCityPageUrl;
}
}
}
async function main() {
const scraper = new CapitalCityScraper();
for await (let url of scraper.crawlCountries("https://en.wikipedia.org/wiki/List_of_European_countries_by_area")) {
console.log(url);
}
}
💻 See the commit 5ad62c43
Notice the
*
beforecrawlCountries
method's name. Theyield
keyword. This creates a generator.In the
main
function, there is afor await
contruct. Why? Because thecrawlCountries
is async so it returns async generator (iterator). Which means the.next()
call is asynchronous (returs a promise) andfor await
takes care of it by awaiting on each iteration.
If you now run the code, the output will be the same as when we printed the URL directly in crawlCountries
method (printing URLs one by one), but notice we now print them outside of the crawlCountries
function, isn't it cool?
Putting it together
Do you know what's next? Yes, put everything together: the crawler methods with the scraper ones we built in the previous article. And the change is very simple, just one line. Are you excited as much as me 🤩?
Replace the console.log(url)
in the main
function by scrapeCity
method call.
async function main() {
const scraper = new CapitalCityScraper();
for await (let url of scraper.crawlCountries("https://en.wikipedia.org/wiki/List_of_European_countries_by_area")) {
console.log(await scraper.scrapeCity(url));
}
}
💻 See the commit e77e7e27
Go go, run the code! 🙏
$ npx ts-node index.ts
{
name: 'Moscow',
country: 'Russia',
area: NaN,
population: NaN,
flagImagePath: 'flags\\Flag_of_Moscow%2C_Russia.svg'
}
{
name: 'Kyiv',
country: 'Ukraine',
area: 839,
population: 2962180,
flagImagePath: 'flags\\Flag_of_Kyiv_Kurovskyi.svg'
}
{
name: 'Paris',
country: 'France',
area: NaN,
population: NaN,
flagImagePath: 'flags\\Flag_of_Paris_with_coat_of_arms.svg'
}
...
Wow, it works!! Yes, yes ... or ... oh, wait. 😮 What's that NaN
?? It doesn't look like a number, yes it's not a number. Have you looked in flags
folder? There is a weird file with a name undefined
. What happened? Again, the Wikipedia's lack of strict structure. The problem is some city pages are formatted slightly differently, and our scraper doesn't handle it well.
Don't worry. This is quite common when you try to scrape a website. Even though you do some analysis before, you will probably encounter unexpected problems when you scrape the bigger portion of the website. Some page will have missing values, some will have different formatting, some pages cannot be loaded, etc. This will likely to happen and you need to make your scraper robust enough to handle all the cases. Be prepared that sometimes you won't be able to scrape everything or with too much effort for a little outcome.
Fixing the unexpected
I did the hard work for you and analyzed the pages. Before we begin, we need to update our model because not every value is possible to scrape.
interface City {
name: string;
country?: string;
area?: number;
population?: number;
flagImagePath?: string;
}
Then update the scrapeCity
method itself.
After cheerio parses the HTML, remove the elements with reference links from infobox (the box with the data).
They make the scraping harder and we don't need them.
const $ = cheerio.load(html);
$('.infobox .reference').remove();
City states don't have the country specified, and the row doesn't always have the class mergedtoprow
.
const country = $('.infobox th:contains(Country) + td').first().text().trim() || undefined;
Area values don't have always the same label and are not in the same place. Sometimes the value is even next to the "Area" label (see Paris). So I decided to find and use the first value in the correct format (a number with km squared unit).
const areaRows = $('.mergedtoprow th:contains(Area)').parent().nextUntil('.mergedtoprow').addBack();
const areaValues = areaRows.find('td').filter((i, el) => !!$(el).text().trim().match(/^[0-9,.]+\s+km2.*$/));
const areaText = areaValues.first().text().trim().replace(/ km2.*$/, '');
const area = parseFloat(areaText.replace(/,/g, '')) || undefined;
Population scraping is similar to the area. Only the matching regular expression is different (a number optionally followed with a text in parentheses, nothing else).
const populationRows = $('.mergedtoprow th:contains(Population)').parent().nextUntil('.mergedtoprow').addBack();
const populationValues = populationRows.find('td').filter((i, el) => !!$(el).text().trim().match(/^[0-9,.]+(\s+\(.*\))?$/));
const populationText = populationValues.first().text().trim();
const population = parseFloat(populationText.replace(/,/g, '')) || undefined;
Not every city has a flag and also the row doesn't always have the class mergedtoprow
.
const flagPageLink = $('.infobox a.image + div:contains(Flag)').prev().attr('href')!;
const flagPageUrl = flagPageLink && new URL(flagPageLink, url).toString();
const flagImagePath = flagPageUrl && await this.scrapeImage(flagPageUrl);
💻 See the commit 0402d629
Run the code now. Despite the data are not 100% complete, you can be quite satisfied with the result.
Full completeness of the scraped data from sites with a loose structure, such as Wikipedia, is very hard to achieve. Sometimes almost not even possible.
Polishing
We can say we are finished but I would prefer to polish the scraper class a little bit more.
I would rather encapsulate the scraper, so it exposes only one method scrape
which will return a generator of scraped City
items.
Make all the methods in the scraper class protected
and add new public method scrape
along with the class property holding the starting URL.
readonly startUrl = "https://en.wikipedia.org/wiki/List_of_European_countries_by_area";
async *scrape() {
for await (let url of this.crawlCountries(this.startUrl)) {
yield await this.scrapeCity(url);
}
}
Actually, we moved the code from the main
function to the scrape
method. And in the main
we only get the scraped city object without caring about the internals.
async function main() {
const scraper = new CapitalCityScraper();
for await (let city of scraper.scrape()) {
console.log(city);
}
}
💻 See the commit 32f506c9
Hey, we are finished. Let's celebrate! 🥳🎉🍾🥂🍸
Final code
💻 See the complete project in the GitHub repository
import fs from 'fs';
import path from 'path';
import axios from 'axios';
import cheerio from 'cheerio';
interface City {
name: string;
country?: string;
area?: number;
population?: number;
flagImagePath?: string;
}
export class CapitalCityScraper {
readonly startUrl = "https://en.wikipedia.org/wiki/List_of_European_countries_by_area";
async *scrape() {
for await (let url of this.crawlCountries(this.startUrl)) {
yield await this.scrapeCity(url);
}
}
async *crawlCountries(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const links = $('.wikitable td:nth-child(2) > a').get();
for (let link of links) {
const href = $(link).attr('href')!;
const countryPageUrl = new URL(href, url).toString();
const capitalCityPageUrl = await this.crawlCountry(countryPageUrl);
if (capitalCityPageUrl) {
yield capitalCityPageUrl;
}
}
}
async crawlCountry(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
const capitalCityRow = $('.infobox th:contains(Capital)').parent();
if (capitalCityRow.length === 0) {
// city state is capitaly city itself
return url;
}
const capitalCityLink = capitalCityRow.find('td > a').attr('href');
if (!capitalCityLink) {
// in some cases the capital city cannot be scraped
// (possible, but too complicated for this article)
return undefined;
}
const capitalCityUrl = new URL(capitalCityLink, url).toString();
return capitalCityUrl;
}
async scrapeCity(url: string) {
const response = await axios.get(url);
const html = response.data;
const $ = cheerio.load(html);
$('.infobox .reference').remove();
const cityName = $('#firstHeading').text().trim();
const country = $('.infobox th:contains(Country) + td').first().text().trim() || undefined;
const areaRows = $('.mergedtoprow th:contains(Area)').parent().nextUntil('.mergedtoprow').addBack();
const areaValues = areaRows.find('td').filter((i, el) => !!$(el).text().trim().match(/^[0-9,.]+\s+km2.*$/));
const areaText = areaValues.first().text().trim().replace(/ km2.*$/, '');
const area = parseFloat(areaText.replace(/,/g, '')) || undefined;
const populationRows = $('.mergedtoprow th:contains(Population)').parent().nextUntil('.mergedtoprow').addBack();
const populationValues = populationRows.find('td').filter((i, el) => !!$(el).text().trim().match(/^[0-9,.]+(\s+\(.*\))?$/));
const populationText = populationValues.first().text().trim();
const population = parseFloat(populationText.replace(/,/g, '')) || undefined;
const flagPageLink = $('.infobox a.image + div:contains(Flag)').prev().attr('href')!;
const flagPageUrl = flagPageLink && new URL(flagPageLink, url).toString();
const flagImagePath = flagPageUrl && await this.scrapeImage(flagPageUrl);
const city: City = {
name: cityName,
country,
area,
population,
flagImagePath
};
return city;
}
protected async scrapeImage(url: string) {
const response = await axios.get(url);
const html = response.data;
const doc = cheerio.load(html);
const imageLink = doc('#file a').attr('href')!;
const imageUrl = new URL(imageLink, url).toString();
const imagePath = await this.downloadFile(imageUrl, 'flags');
return imagePath;
}
protected async downloadFile(url: string, dir: string) {
const response = await axios.get(url, {
responseType: 'arraybuffer'
});
fs.mkdirSync(dir, { recursive: true });
const filePath = path.join(dir, path.basename(url));
fs.writeFileSync(filePath, response.data);
return filePath;
}
}
async function main() {
const scraper = new CapitalCityScraper();
for await (let city of scraper.scrape()) {
console.log(city);
}
}
main();
Conclusion
Now, you have all you need to build a web scraper, at least a simple one. You saw that sometimes you have to compromise and decide between completeness or code simplicity. But still you can mostly retrieve valuable information even from not well-structured websites.
Hope you like web scraping. 😉
There are some advanced topics, how you can improve the scraper, like execution management, auto throttling, etc. In the next article, I will show you one: the use of a proxy.
👉 Stay tuned! See you soon.
Top comments (0)