This is going to be the first in a series of teaching how to do web scraping. The target of these posts will be mostly towards people who have done barely any programming before but would like to get into web scraping. This particular post will focus on web scraping with cheeriojs,
I am going to try and make it as simple and easy to understand as possible, while not focusing on programming. Web scraping will be the focus of this series and not programming.
The tools and getting started
This section I will include in every post of this series. It’s going to go over the tools that you will need to have installed. I’m going to try and keep it to a minimum so you don’t have to add a bunch of things.
Nodejs – This runs javascript. It’s very well supported and generally installs in about a minute. You’ll want to download the LTS version, which is 12.13.0
at this time. I would recommend just hitting next through everything. You shouldn’t need to check any boxes. You don’t need to do anything further with this at this time.
Visual Studio Code – This is just a text editor. 100% free, developed by Microsoft. It should install very easily and does not come with any bloatware.
You will also need the demo code referenced at the top and bottom of this article. You will want to hit the “Clone or download” button and download the zip file and unzip it to a preferred location.
Once you have it downloaded and with Nodejs installed, you need to open Visual Studio Code and then go File > Open Folder and select the folder where you downloaded the code.
We will also be using the terminal to execute the commands that will run the script. In order the open the terminal in Visual Studio Code you go to the top menu again and go Terminal > New Terminal. The terminal will open at the bottom looking something (but probably not exactly like) this:
It is important that the terminal is opened to the actual location of the code or it won’t be able to find the scripts when we try to run them. In your side navbar in Visual Studio Code, without any folders expanded, you should see a > src
folder. If you don’t see it, you are probably at the wrong location and you need to re-open the folder at the correct location.
After you have the package downloaded and you are at the terminal, your first command will be npm install
. This will download all of the necessary libraries required for this project.
Enter Cheeriojs
Cheeriojs is a javascript library that makes it extremely easy to parse html. It uses CSS selectors in order to select the text or html properties that you want. You can find all of its detailed code and instructions here.
While I do plan on going over the most common uses with cheeriojs using CSS selectors, I strongly recommend getting familiar with CSS selectors and basic HTML format. CSS selectors are critical to almost any library that does web scraping. The concept is fairly simple and there are abundant resources helping so I won’t go in depth here. This guide by w3school is very good and I visit it regularly.
HTML parser
In a normal web scraping project, we’d call to some exterior page, get the html and then get what we wanted out of the html. In this example we are just isolating the html and testing it locally. I took the html for this example from a beloved site – http://pizza.com. Because I love pizza.
You can see in the src
directory that there is a sample-html.ts
file. This file contains all of the html from this page in a big string. This we can easily use to simulate as if we are actually calling the page. At the top of our src/index.ts
file (where we will be doing all of our coding this time) you can see that we import the sample-html
with import { sampleHtml } from './sample-html';
.
Whenever I go to scrape a website, I am always looking at the html to see how to select the items I want. Developer tools is my best friend and should be yours as well. You can open it with F12 and then see all of the html in there. As you highlight over the different parts of the html it will highlight on the screen. See this example:
This is how we find which CSS selectors that we are going to use to select the items we want.
To the code
Alright, the code section is going to be fairly simple. Remember that you can run your code at any time by typing npm start
in the terminal where we you ran npm install
and it should output all of our console.log
s in src/index.ts
.
The first thing we with cheeriojs is to import the cheeriojs library and then load up the html, as follows:
import cheerio from 'cheerio';
const $ = cheerio.load(sampleHtml);
Now we can use the $
throughout our code to select the items we want. The first and easiest portion to select will be the title of our page. The code looks like this:
// Search by element
const title = $('title').text();
console.log('title', title);
Because title is an html element, we can simply select it with 'title'
and nothing else. Then we get the text from within that html element.
Within developer tools you can see the title element containing “Pizza.com”. Title is the easiest selector but you will rarely only have one of an html element. Title is an exception to this rule.
Another helpful tip with developer tools is the arrow button in the top right of the Elements panel. We can use it to select the item we are looking for and it’ll find it within the html for us.
So we can see above that if we wanted to get information from the first nav button, we could find with the class of “home_link”. The code to do so looks like this:
// Search by class
const homeButton = $('.home_link').text();
console.log('Home button', homeButton);
Whenever we select with a class, we put a single period in front of the class name. In this example, '.home_link'
is what we are looking for. This outputs “Home” because it goes and finds all text within this element, including its children. I say children because html is described with familial terms. The parent would be the top level html element while anything within it would be children. Any elements within those children would be grandchildren. You also use siblings and grandparents to help describe their relation to each.
To highlight this, let’s grab the text from all of the top nav buttons. The html structure is as follows:
The ul
is the parent of all of those li
elements and the grandparent of any elements within that. And as you can see from what we have highlighted in our website, it represents the whole nav. The code to select those is like this:
// Search by class and child
const topNavButtons = $('.word-only li').text();
console.log('top nav buttons', topNavButtons);
This time we are using the class and then selecting all list elements li
that are children of the the .word-only
class. The log in the terminal for this item looks like this:
Now, what happened here? I know our log is a bit cutoff but I’m there are definitely more items than expected, aren’t there? This is the trickiest bit of web scraping. CSS selectors will find all items that match the selector you use. If we look down a bit within our html, we can see that there is another section that also has the same html set up, with the same class (.word-only
) and element (li
).
So, sometimes we have to use other methods to get more specific about what we want. One of those tools is that you can select items by their properties.
// Search by property
const pizzaNews = $('a[href="/pizza-news"]').text();
console.log('pizza news', pizzaNews);
This will log out the text from this element, which is “Pizza News”. Thus far, everything we have used to find these elements has been using CSS selectors. Remember to look back at that w3schools cheat sheet whenever you need.
Next we will leverage some of the tools of cheeriojs. Sometimes there is a big list of items and we only want the first in the list. Cheeriojs makes it very simple with something like this:
// Search by property and find only the first
const firstNavLink = $('li a').first().text();
console.log('first nav link', firstNavLink);
This looks finds the element with that selector, li a
and then finds just the first of it. In this case, it logs out “Home”.
You can also do this with the last element.
// Search by property and find only the last
const lastNavLink = $('li a').last().text();
console.log('last nav link', lastNavLink);
Sometimes, you don’t want the text of the element but something else. Cheeriojs also allows you to grab a property from html elements, like this:
/ Get propery from element
const funFactsLink = $('.last a').prop('href');
console.log('fun facts link', funFactsLink);
Finally, with web scraping you will often want a lot of data from a table that all have the same selector. So you want one piece of code to go and select it all and then you want to do something with each item like push it into a csv, for example. Cheeriojs allows that very easily with this:
// Access each of a list in a loop
$('li').each(function (index, element) {
console.log('this text', $(element).text());
});
We select all list items and loop through them with .each
and then we log out the text of each one but we certainly could do anything else. The log looks like this:
The end of cheeriojs
That will conclude my intro to cheeriojs. It’s a very powerful tool but simple. Should you be feeling more ambitious, I strongly recommend trying with your own html. Just go to a website, right click, and then hit “View Page Source”. From there you can select all and replace the big string in src/sample-html.ts
.
If you are looking for some more advanced uses of cheeriojs, I have a blog post where I use cheeriojs when scraping craigslist.
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Cheeriojs. Jordan Teaches Web Scraping appeared first on JavaScript Web Scraping Guy.
Top comments (1)
awsome!