Web Scrapper
A web scraper is a software tool or program that automates the process of collecting data from websites. It uses automated scripts or bots to extract data from web pages by reading and analyzing the HTML code of the page. Web scrapers can be used to extract a wide range of data, such as product prices, reviews, social media posts, and more.
Web scraping has become increasingly popular in recent years as a means of gathering data for research, market analysis, and business intelligence. However, it is important to note that some websites explicitly prohibit web scraping in their terms of service, and scraping certain types of data may be illegal in some jurisdictions. As such, it is important to ensure that you are not violating any laws or policies before using a web scraper.
Here we will create a very basic version of a web scrapper which will allow us to scrap HTML data from websites.
Steps
- Make sure you have node installed with the latest version.
node --version
Start by creating a directory and run
npm init
, this
will create a package.json file which will help in
dependency management and script management, create an
index.js file next.Install four packages using npm:
npm i express
npm i cheerio
npm i axios
npm i nodemon
Now in your package.json, change the "start" in the "scripts" to
nodemon index.js
. Your package.json will start looking something like this.
- Now in your index.js, load the modules using:
const express = require('express')
const cheerio = require('cheerio')
const axios = require('axios')
- Create a listening port like this:
const PORT = 8000
const app = express()
app.listen(PORT , ()=> console.log(`Server running on port ${PORT}`))
Now take the url of the website which you want to scrape and store the url in any variable.
Then using axios create a promise and url as its parameters.
axios(url)
.then(response => {
const html = response.data
const $ = cheerio.load(html)
const articles = []
$('li', html).each(function() {
const title = $(this).text()
const url = $(this).find('a').attr('href')
articles.push({
title,
url
})
})
console.log(articles)
}).catch(err => console.log('Error occured'))
Here articles is an array which will store the scrapped data. In the above code I have scrapped a website to get data of all the
- tag and if they have any tag then the links.
Finally add all the data to the articles array.
The final code would start appearing like this :
- Now in the terminal run
npm start
to see the results.
Output of above code:
Top comments (0)