peytono

Posted on Apr 14, 2024 • Edited on Apr 19, 2024

Learn Web Scraping with Cheerio

#webscraping #cheerio #programming #node

An overview

Haven’t heard of web scraping before? If you want to traverse web pages through Node.js, you’re in the right place! Web scraping is a process by which you load or connect to an external web page and extract data or HTML from the page. While an old practice, is now commonly used in machine learning and still very useful in innovation! Most web scraping libraries have their own API service. When it comes to Cheerio, its syntax is very similar to jQuery. For example, to grab all img tags from your loaded HTML:

const $img = $(‘img’);

Cheerio is a more simple solution to web scraping, but there are situations that it won’t work and you’ll need to actually interact with the page or get content not accessible through Cheerio.

Is Cheerio right for you?

There are many more options out there than just Cheerio, but if needing HTML from a web page, this is a great one. If all you need is to select elements and some DOM manipulation, go with Cheerio. It’s super fast and has a very smooth learning curve. Unfortunately, if the site you need uses dynamic content or pagination, this is a sign that Cheerio might not work. You’ll absolutely want to check out your intended site with your Chrome Dev Tools to see what you’re working with! While you’re doing some digging, check for a robot.txt file to see if they have web scraping guidelines to follow. If this is your case, you may want to check out Puppeteer, built to be a browser animation tool, it allows you to interact with a site and get any fetched data.

Installation

Getting started with Cheerio is quick and easy! Cheerio has support for yarn, npm, and pnpm. To install through npm:

npm install cheerio

Then you can import cheerio to your file. If using ES6 or newer, you can then use:

import * as cheerio from 'cheerio';

or for commonJS:

const cheerio = require('cheerio');

Now you’re ready to use Cheerio!

Getting Started

There are several different loading methods in Cheerio, but the most supported and only one available out of the box is load. This method has one required argument, HTML in a string. Here’s a very basic example call...

const $ = cheerio.load(`
  <body>
    <h1>Heading</h1>
    <p>Some website content</p>
  </body>
`); 
console.log($('p').text());

Receiving log:

Some website content

Now you know the Cheerio basics!

Functional Usage

So, this is all that’s needed if you already have the HTML, but you likely won’t. Now let’s say you wanted to know the most popular Reddit communities. First, you should go to Reddit and inspect the page. You can select the element you’d like by clicking the icon with a dotted line box and a cursor. Now you can see the tag name with all the classes of this element. And we can go to writing some code!

Since the load method takes a string of the HTML, we’ll first need to get the site's HTML. This can very easily be done with Axios. Once getting the response, you can use the load method on the response data. Then we can initialize a popular communities variable set to the return of the selector, passing in our tag name and classes, each separated by a period(Noticing the jQuery?).

axios.get('https://www.reddit.com/?feed=home')
   .then(({ data }) => {
     const $ = cheerio.load(data);
     const $popCommunities = $('span.text-neutral-content.block.text-ellipsis.whitespace-nowrap.overflow-hidden');
     console.log($popCommunities.text().split('r/').slice(1));
   })
   .catch((err) => console.error('Failed getting reddit HTML', err));

Logs to the console:

[
  'DestinyTheGame',   'anime',
  'destiny2',         'FortNiteBR',
  'dndnext',          'buildapc',
  'techsupport',      'jailbreak',
  'LivestreamFail',   'legaladvice',
  'LifeProTips',      'AskCulinary',
  'Philippines',      'memes',
  'Rainbow6',         'Sneakers',
  'learnprogramming', 'RedDeadOnline',
  'jobs'
]

If you’re trying to get data unavailable through Cheerio

We discussed before that sometimes Cheerio is not good for your case, wanted to show an example. It can be hard to know when you’re using a new technology incorrectly or it just can’t do what you need! Let’s check out www.neworleans.com to check out some events going on.

Let’s try the same thing we tried before to find what we need. Doing that, we’d want to find the div with classes ‘shared-item’ and ‘item’, spoiler, the query isn’t working! So, I’ll back out to the parent, with the class ‘shared-items-container’ to find the HTML inside.

axios.get(
     'https://www.neworleans.com/events/upcoming-events/?skip=0&categoryid=40&startDate=04%2F11%2F2024&endDate=05%2F11%2F2024&sort=title',
   )
   .then(({ data }) => {
     const $ = cheerio.load(data);
     const $sharedItems = $('div.shared-items-container');

     console.log($sharedItems.html());
   });

The log we see is:

<div class="shared-items">
  <div 
    class="container" 
    data-sv-items-wrapper="" 
    data-sv-items=""
  >
  </div>
</div>

Uh oh, the container div is empty! Unfortunately, this is data that Cheerio does not have access to. In this case, you may want to check out Puppeteer.

Outro

If you had been familiar with API’s, but weren’t finding the information you needed, just wanted to learn more about a new topic, or were confused you weren’t getting the data you were expecting, I hope this helped you out! Now you know when to use Cheerio and how to get started. I’m pretty amazed at how quickly Cheerio can load and traverse the entire DOM of a site. Happy web scraping!

Resources

Cheerio Docs
Cheerio vs Puppeteer
What are the best practices for ensuring accurate and reliable web scraping results?
Web Scraping With JavaScript And NodeJS
Puppeteer Docs

DEV Community

Learn Web Scraping with Cheerio

An overview

Is Cheerio right for you?

Installation

Getting Started

Functional Usage

Outro

Resources

Top comments (0)

Read next

Copier vs Cookiecutter

Revolutionary Two-Layer Framework Makes Agent-Based Models More Realistic and Adaptive

Finally got some time to play with the new JSONata and Variables support for Step Functions, and I have to say, it is massive improvement. Check out my latest blog post, where I walk through a simple example of how easy it is to handle pagination now

React 19: New hook useActionState