DEV Community

Cover image for Web Scraping using axios and Cheerio
Abayomi Ogunnusi
Abayomi Ogunnusi

Posted on

Web Scraping using axios and Cheerio

Hello folks, today I will be sharing information on Web Scraping. Web scraping is simply the process of extracting content and data from a website. This post is only for Educational purposeโ—

scraping

Prerequisite

๐Ÿ‘จโ€๐Ÿ’ป Nodejs
๐Ÿ‘จโ€๐Ÿ’ป Developer Tool knowledge (DevTools)
๐Ÿ‘จโ€๐Ÿ’ป Document Object Model knowledge


Let's Start

๐Ÿฅฆ Make a new directory in my case nodescraping and initiate a node js app
npm init -y

pik1

๐ŸŽฏ Result: Creates your package.json file pik2

๐Ÿฅฆ Install dependencies
npm i express axios cheerio
pik3

๐ŸŽฏ Result:
pik 4

๐Ÿฅฆ Install Dev dependencies (for development purposes), nodemon restarts our node app automatically when files change..
npm i nodemon -save-dev

๐ŸŽฏ Result:
pik6

๐Ÿฅฆ Edit your start script

  "start": "node app.js ",
   "dev:": "nodemon app.js"
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฏ Result:
pik7


๐Ÿฅฆ create a file app.js and import the packages

const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');

const port = process.env.PORT || 4000;

const app = express();
Enter fullscreen mode Exit fullscreen mode

๐Ÿฅฆ I will be using the axios package to fetch the website. I will be using a site called Dev.to๐Ÿ˜. Be at liberty to use any website of your choice. We will be scraping and exporting our result into a plain text file CSV.


๐Ÿฅฆ Right-click, to inspect the website to select elements (class, is) and their respective attributes (a, li).
pik1

๐ŸŽฏ This gives us the ability to inspect the classes we want to select.

๐Ÿฅฆ I want to target the following. Blog title, link, author, and read the time.
pik2


Side note:

Always use . before the class name you want to target.

axios.get('https://dev.to/')
    .then(res => {
        const $ = cheerio.load(res.data)
        $('.crayons-story').each((index, element) => {
            const blogTitle = $(element).find('.crayons-story__title').text()

    }).catch(err => console.error(err))
Enter fullscreen mode Exit fullscreen mode

In the logic above, i am target the child element of the class crayons-story.

The .text() method is converting the result to text.

๐Ÿฅฆ I repeated the whole process to select the Blog link, author, and read the time.


๐Ÿฅฆ Final logic is:

const axios = require('axios');
const cheerio = require('cheerio');
const express = require('express');
require('dotenv').config();
const fs = require('fs');
const writeStream = fs.createWriteStream('devBlog.csv');

const port = process.env.PORT || 4000;

const app = express();

//write headers
writeStream.write(`author, BlogTitle, bloglink, readtime \n`);


axios.get('https://dev.to/')
    .then(res => {
        const $ = cheerio.load(res.data)
        $('.crayons-story').each((index, element) => {

            const author = $(element).find('.profile-preview-card__trigger').text().replace(/\s\s+/g, '')
            const blogTitle = $(element).find('.crayons-story__title').text().replace(/\s\s+/g, '')
            const blogLink = $(element).find('a').attr('href');
            const readTime = $(element).find('.crayons-story__tertiary').text()
            const dev = 'https://dev.to'
            const joinedBlogLink = `${dev}` + `${blogLink}`;
            writeStream.write(`Author: ${author}, \n Blog title is : ${blogTitle} ,\n Blog link: ${joinedBlogLink}, \n Blog read time : ${readTime} \n`);
        });


    }).catch(err => console.error(err))

//Listen to server
app.listen(port, () => {
    console.log(`Server Established and  running on Port โšก${port}`)
})
Enter fullscreen mode Exit fullscreen mode

View source code here: here


Notes and explanation

  • fs module was used to write the final result into the devBlog.csv file
  • \n equals to a new line
  • .replace(/\s\s+/g, '') removes white spaces between the author's field.
  • axios fetch markup data from the URL
  • cheerio grabs the html data from the URL. Cheerio is a tool for parsing HTML and XML in Node.js.
  • the cheerio.load method loads the website mark up and stores the value in the declared variable, in my case $
  • .each method loops through the selected elements.

๐Ÿฅฆ Run server npm run dev

๐ŸŽฏ Result:

author, BlogTitle, bloglink, readtime 
Author: Gracie Gregory (she/her), 
 The blog title is : What was your win this week? ,
 Blog link: https://dev.to/devteam/what-was-your-win-this-week-5h25, 
 Blog read time :  for Oct 8
            1 min read

Author: Jeremy Friesen, 
 Blog title is : Trick or Treat, I've Joined the DEV Team ,
 Blog link: https://dev.to/jeremyf/trick-or-treat-i-ve-joined-the-dev-team-4283, 
 Blog read time : Oct 8
            5 min read

Author: Michael, 
 Blog title is : How To See Which Branch Your Teammate Is On In Android Studio ,
 Blog link: https://dev.to/gitlive/how-to-see-which-branch-your-teammate-is-on-in-android-studio-2n3i, 
 Blog read time :  for Oct 8
            1 min read

Author: Iain Freestone, 
 Blog title is : ๐Ÿš€10 Trending projects on GitHub for web developers - 8th October 2021 ,
 Blog link: https://dev.to/iainfreestone/10-trending-projects-on-github-for-web-developers-8th-october-2021-102e, 
 Blog read time : Oct 8
            3 min read

Author: AM, 
 Blog title is : Django Cloud Task Queue ,
 Blog link: https://dev.to/txiocoder/django-cloud-task-queue-27g2, 
 Blog read time : Oct 8
            1 min read

Author: Ankit Anand โœจ, 
 Blog title is : AWS X-Ray vs Jaeger - key features, differences and alternatives ,
 Blog link: https://dev.to/signoz/aws-x-ray-vs-jaeger-key-features-differences-and-alternatives-322, 
 Blog read time :  for Oct 8
            6 min read

Author: Raquel Romรกn-Rodriguez, 
 Blog title is : Algo Logging: the Longest Substring of Unique Characters in JavaScript ,
 Blog link: https://dev.to/raquii/algo-logging-the-longest-substring-of-unique-characters-in-javascript-4i3, 
 Blog read time : Oct 8
            3 min read

Author: Shaher Shamroukh, 
 Blog title is : Working With Folders & Files In Ruby ,
 Blog link: https://dev.to/shahershamroukh/working-with-folders-files-in-ruby-2l97, 
 Blog read time : Oct 8
            3 min read

Author: Roberto Ruiz, 
 Blog title is : Untangling Your Logic Using State Machines ,
 Blog link: https://dev.to/robruizr/untangling-your-logic-using-state-machines-2epj, 
 Blog read time : Oct 8
            5 min read

Author: Cubite, 
 Blog title is : How To Manage Open edXยฎ Environment Variables Using Doppler and Automating The Deployment ,
 Blog link: https://dev.to/corpcubite/how-to-manage-open-edx-environment-variables-using-doppler-and-automating-the-deployment-4c5e, 
 Blog read time : Oct 8
            5 min read

Author: OpenReplay Tech Blog, 
 Blog title is : Building an Astro Website with WordPress as a Headless CMS ,
 Blog link: https://dev.to/asayerio_techblog/building-an-astro-website-with-wordpress-as-a-headless-cms-47mo, 
 Blog read time : Oct 8
            9 min read

Author: Anamika, 
 Blog title is : How to setup Appwrite on Ubuntu ,
 Blog link: https://dev.to/noviicee/how-to-setup-appwrite-on-ubuntu-3j67, 
 Blog read time : Oct 8
            4 min read

Author: Bryan Robinson, 
 Blog title is : Building server-rendered search for static sites with 11ty Serverless, Netlify, and Algolia ,
 Blog link: https://dev.to/algolia/building-server-rendered-search-for-static-sites-with-11ty-serverless-netlify-and-algolia-13e2, 
 Blog read time :  for Oct 8
            8 min read

Author: bhupendra, 
 Blog title is : Understanding Redux without React ,
 Blog link: https://dev.to/bhupendra1011/understanding-redux-without-react-223n, 
 Blog read time : Oct 8
            4 min read

Author: Rizel Scarlett, 
 Blog title is : Add Fuzzy Search to Your Web App with this Open Source Tool ,
 Blog link: https://dev.to/github/add-fuzzy-search-to-your-web-app-with-this-open-source-tool-22d7, 
 Blog read time :  for Oct 8
            6 min read

Author: Marcelo Sousa, 
 Blog title is : Ship / Show / Ask With Reviewpad ,
 Blog link: https://dev.to/reviewpad/ship-show-ask-with-reviewpad-47jh, 
 Blog read time :  for Oct 8
            5 min read

Author: Shantanu Jana, 
 Blog title is : Random Gradient Generator using JavaScript & CSS ,
 Blog link: https://dev.to/shantanu_jana/random-gradient-generator-using-javascript-css-529c, 
 Blog read time : Oct 8
            6 min read

Author: Miles Watson, 
 Blog title is : URL Shortener with Rust, Svelte, & AWS (6/): Deploying to AWS ,
 Blog link: https://dev.to/mileswatson/url-shortener-with-rust-svelte-aws-6-deploying-to-aws-2gi0, 
 Blog read time : Oct 8
            4 min read

Author: Jon Deavers, 
 Blog title is : Publishing my first NPM package ,
 Blog link: https://dev.to/lucsedirae/publishing-my-first-npm-package-200g, 
 Blog read time : Oct 8
            3 min read

Author: Anjan Shomooder, 
 Blog title is : CSS positions: Everything you need to know ,
 Blog link: https://dev.to/thatanjan/css-positions-everything-you-need-to-know-2ng4, 
 Blog read time : Oct 8
            4 min read

Author: Alvaro Montoro, 
 Blog title is : Divtober Day 8: Growing ,
 Blog link: https://dev.to/alvaromontoro/divtober-day-8-growing-1182, 
 Blog read time : Oct 8
            1 min read

Author: Jambang J, 
 Blog title is : Deploying an discordjs bot to Qovery ,
 Blog link: https://dev.to/jambang067/deploying-an-discordjs-bot-to-qovery-51e, 
 Blog read time : Oct 8
            7 min read

Author: Sadee, 
 Blog title is : How to create responsive navbar {twitter clone} with HTML CSS ,
 Blog link: https://dev.to/codewithsadee/how-to-create-responsive-navbar-twitter-clone-with-html-css-6fa, 
 Blog read time : Oct 8
            1 min read

Author: Jeremy Grifski, 
 Blog title is : Support The Sample Programs Repo This Hacktoberfest ,
 Blog link: https://dev.to/renegadecoder94/support-the-sample-programs-repo-this-hacktoberfest-42ad, 
 Blog read time : Oct 8
            5 min read

Author: Sebastian Rindom, 
 Blog title is : Making your store more powerful with Contentful ,
 Blog link: https://dev.to/medusajs/making-your-store-more-powerful-with-contentful-3efk, 
 Blog read time :  for Oct 8
            7 min read

Author: Shalvah, 
 Blog title is : A practical tracing journey with OpenTelemetry on Node.js ,
 Blog link: https://dev.to/shalvah/a-practical-tracing-journey-with-opentelemetry-on-node-js-5706, 
 Blog read time : Oct 8
            16 min read

Author: Kingsley Ubah, 
 Blog title is : How to build an Accordion Menu using HTML, CSS and JavaScript ,
 Blog link: https://dev.to/ubahthebuilder/how-to-build-an-accordion-menu-using-html-css-and-javascript-3omb, 
 Blog read time : Oct 7
            6 min read

Author: mike1237, 
 Blog title is : Create Proxmox cloud-init templates for use with Packer ,
 Blog link: https://dev.to/mike1237/create-proxmox-cloud-init-templates-for-use-with-packer-193a, 
 Blog read time : Oct 8
            3 min read

Author: Prosper Yong, 
 Blog title is : Get Paid Writing ,
 Blog link: https://dev.to/yongdev/get-paid-writing-2i2j, 
 Blog read time : Oct 8
            1 min read

Author: Debbie O'Brien, 
 Blog title is : Understanding TypeScript ,
 Blog link: https://dev.to/debs_obrien/understanding-typescript-378g, 
 Blog read time : Oct 8
            5 min read

Author: Matias D, 
 Blog title is : Show me your portfolio ,
 Blog link: https://dev.to/matiasdandrea/show-me-your-portfolio-1l9h, 
 Blog read time : Oct 8
            1 min read

Author: Marcos Henrique, 
 Blog title is : You should use Buildpacks instead Dockerfile and I'll tell you why ,
 Blog link: https://dev.to/wakeupmh/you-should-use-buildpack-instead-dockerfile-and-i-ll-tell-you-why-2n6, 
 Blog read time : Oct 8
            2 min read

Author: Gaurav Gupta, 
 Blog title is : Smart Notes - A Build-in Public Product. BuildLog[1] ,
 Blog link: https://dev.to/gauravgupta/smart-notes-a-build-in-public-product-buildlog-1-kj6, 
 Blog read time : Oct 8
            4 min read

Author: Andrea Giammarchi, 
 Blog title is : About bitwise operations ,
 Blog link: https://dev.to/webreflection/about-bitwise-operations-29mm, 
 Blog read time : Oct 8
            10 min read

Author: AbcSxyZ, 
 Blog title is : Business models of Free and Open Source software ,
 Blog link: https://dev.to/abcsxyz/business-models-of-free-and-open-source-software-2cg8, 
 Blog read time : Oct 8
            4 min read

Author: Saharsh Laud, 
 Blog title is : Face Detection in just 15 lines of Code! (ft. Python and OpenCV) ,
 Blog link: https://dev.to/saharshlaud/face-detection-in-just-15-lines-of-code-ft-python-and-opencv-37ci, 
 Blog read time : Oct 8
            4 min read

Author: Kaustubh Joshi, 
 Blog title is : Hello, I'm HTTP and these are my request methods๐Ÿ‘‹๐Ÿป ,
 Blog link: https://dev.to/elpidaguy/hello-i-m-http-and-these-are-my-request-methods-co, 
 Blog read time : Oct 8
            3 min read

Author: SilvenLEAF, 
 Blog title is : Easiest way to create a ChatBOT from Level 0 ,
 Blog link: https://dev.to/silvenleaf/easiest-way-to-create-a-chatbot-from-level-0-31pf, 
 Blog read time : Oct 8
            6 min read

Author: whykay ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป๐Ÿˆ๐Ÿณ๏ธโ€๐ŸŒˆ (she/her), 
 Blog title is : ๐Ÿ‘ New EuroPython Fellows ,
 Blog link: https://dev.to/europython/new-europython-fellows-2ob2, 
 Blog read time :  for Oct 8
            1 min read

Author: Zaw Zaw Win, 
 Blog title is : How to pass props object from child component to parent ,
 Blog link: https://dev.to/hareom284/how-to-pass-props-object-from-child-component-to-parent-2a8d, 
 Blog read time : Oct 8
            2 min read

Author: Zack DeRose, 
 Blog title is : The "DeRxJSViewModel Pattern": The E=mc^2 of State Management [Part 1] ,
 Blog link: https://dev.to/zackderose/the-derxjsviewmodel-pattern-the-e-mc-2-of-state-management-part-1-3dka, 
 Blog read time : Oct 8
            23 min read

Author: john methew, 
 Blog title is : Serverless Cloud Application Development with AWS Lambda ,
 Blog link: https://dev.to/johnmethew18/serverless-cloud-application-development-with-aws-lambda-3o7l, 
 Blog read time : Oct 8
            1 min read

Author: Antonio-Bennett, 
 Blog title is : Hacktoberfest Week 1 ,
 Blog link: https://dev.to/antoniobennett/hacktoberfest-week-1-4ebc, 
 Blog read time : Oct 8
            2 min read

Author: ZigRazor, 
 Blog title is : Hacktoberfest Beginners and Advanced Repos to Contribute to ,
 Blog link: https://dev.to/zigrazor/hacktoberfest-beginners-and-advanced-repos-to-contribute-to-p1, 
 Blog read time : Oct 8
            1 min read

Author: Rahul kumar, 
 Blog title is : Added option to share the blog on any social media | @dsabyte.com ,
 Blog link: https://dev.to/ats1999/added-option-to-share-the-blog-on-any-social-media-dsabyte-com-57oo, 
 Blog read time : Oct 8
            2 min read

Author: Kavindu Santhusa, 
 Blog title is : Top 10 trending github repos of the week๐Ÿ’œ. ,
 Blog link: https://dev.to/ksengine/top-10-trending-github-repos-of-the-week-k7, 
 Blog read time : Oct 8
            1 min read

Author: Andre Willomitzer, 
 Blog title is : OpenAQ - My first open source PR :) ,
 Blog link: https://dev.to/andrewillomitzer/openaq-my-first-open-source-pr-3k32, 
 Blog read time : Oct 8
            2 min read

Author: Kinanee Samson, 
 Blog title is : Observables Or Promises ,
 Blog link: https://dev.to/kalashin1/observables-or-promises-29a8, 
 Blog read time : Oct 8
            9 min read

Author: Amador Criado, 
 Blog title is : How to enable versioning in Amazon S3 ,
 Blog link: https://dev.to/aws-builders/how-to-enable-versioning-in-amazon-s3-17m8, 
 Blog read time :  for Oct 8
            2 min read

Author: Bartosz Zagrodzki, 
 Blog title is : React Context - jak efektywnie go uลผywaฤ‡? ,
 Blog link: https://dev.to/bartek532/react-context-jak-efektywnie-go-uzywac-41l, 
 Blog read time : Oct 8
            8 min read

Enter fullscreen mode Exit fullscreen mode

Conclusion:

This is a quick guide on how to scrape websites, there are other packages that can be used to perform the same function such as puppeteer, fetch, request and so on.

Reference

Web scraping by Thomas W.Smith
Web scraping by Traversy Media
Cheerio Docs

Thanks for reading

Top comments (11)

Collapse
 
rahulpar profile image
Rahul Parakkat

Thanks for putting this together Abayomi.

Collapse
 
drsimplegraffiti profile image
Abayomi Ogunnusi

I am glad you found it useful

Collapse
 
pastuh profile image
pastuh

For dynamic content this not works :)

Collapse
 
sadevapp profile image
sadevapp

try to use selenium webdriver, this will work perfectly

Collapse
 
drsimplegraffiti profile image
Abayomi Ogunnusi

Thanks for your contribution

Collapse
 
anatugreen profile image
Anaturuchi

Hi, please how did this come about: require('dotenv').config()?
It is giving me error of module not found

Collapse
 
drsimplegraffiti profile image
Abayomi Ogunnusi

You will install it. npm install dotenv

Collapse
 
anatugreen profile image
Anaturuchi

Thanks Abayomi.

Thread Thread
 
drsimplegraffiti profile image
Abayomi Ogunnusi

You are welcome

Collapse
 
minhhunghuynh1106 profile image
igdev • Edited

How can I get data in the next page if that page has pagination? Thanks for your article!

Collapse
 
drsimplegraffiti profile image
Abayomi Ogunnusi

Thanks for the feedback will read on it