How to Scrape a Static Website

#webscraping #javascript #frontend #data

A really quick tutorial

Prerequisites: Knowledge of React.js will be required for this tutorial.

Let’s say you want to pull data from the frontend of a website because there’s no API available. You inspect the page and see that the data is available in the HTML, so how do you gather that information to be used in your app? It’s rather simple, we’re going to install two libraries and write less than 50 lines of code to demonstrate the scraping of a website. To keep this tutorial simple, we’ll use https://pokedex.org/ as our example.

1) In terminal:

create-react-app scraping-demo
cd scraping-demo
npm i request-promise
npm i cheerio

2) We’re going to start by using request-promise to get the HTML from https://pokedex.org/ into a console log.

In App.js:

	import React, { Component } from "react";
	import rp from "request-promise";

	import "./App.css";

	class App extends Component {
	state = {};

	componentDidMount() {
	// use the request-promise library to fetch the HTML from pokemon.org
	rp("https://pokedex.org/")
	.then(html => console.log(html))
	}

	render() {
	return (
	<div>
	<p>hello</p>
	</div>
	);
	}
	}

	export default App;

view raw App.js hosted with ❤ by GitHub

3) Sometimes you may come across a CORS error blocking you from fetching. For demonstration purposes, try fetching pokemon.com

rp("https://www.pokemon.com/us/pokedex/")

You should see an error like this in the console:

4) You can get around CORS by using https://cors-anywhere.herokuapp.com. Simply add that URL before your desired fetch URL like so:

rp("https://cors-anywhere.herokuapp.com/https://www.pokemon.com/us/pokedex/")

Now you should be able to see the HTML from pokemon.com show in your console.

5) But we won’t have to use cors-anywhere for rp("https://pokedex.org/"), so let’s proceed

6) Now that we have the HTML, let’s use the cheerio library to help us grab the exact data that we want from desired element tags. In this example, we’ll grab all the names of the pokemon then display them in a list.

In App.js:

	import React, { Component } from "react";
	import rp from "request-promise";
	import cheerio from "cheerio";

	import "./App.css";

	class App extends Component {
	state = { names: [] };

	componentDidMount() {
	// use the request-promise library to fetch the HTML from pokemon.org
	rp("https://pokedex.org/")
	.then(html => {
	let names = [];
	let $ = cheerio.load(html);

	// find what element ids, classes, or tags you want from opening console in the browser
	// cheerio library lets you select elements similar to querySelector
	$("#monsters-list li span").each(function(i, element) {
	let name = $(this)
	.prepend()
	.text();
	names.push(name);
	});

	this.setState({ names });
	})
	.catch(function(err) {
	console.log("crawl failed");
	});
	}

	render() {
	return (
	<div>
	<ul>
	{this.state.names.map(name => {
	return <li key={name}>{name}</li>;
	})}
	</ul>
	</div>
	);
	}
	}

	export default App;

view raw App.js hosted with ❤ by GitHub

7) You should see a list of all the pokemon names display onto your screen:

It’s that simple! You scraped those names from the HTML without having to directly access any backend. Now try scraping the examples on http://toscrape.com/ for practice. Enjoy your new abilities!

Bring your friends and come learn JavaScript in a fun never before seen way! waddlegame.com

DEV Community

How to Scrape a Static Website

Top comments (0)