DEV Community

Robert Chen
Robert Chen

Posted on • Originally published at Medium on

1 2

How to Scrape a Static Website

A really quick tutorial

Prerequisites: Knowledge of React.js will be required for this tutorial.

Let’s say you want to pull data from the frontend of a website because there’s no API available. You inspect the page and see that the data is available in the HTML, so how do you gather that information to be used in your app? It’s rather simple, we’re going to install two libraries and write less than 50 lines of code to demonstrate the scraping of a website. To keep this tutorial simple, we’ll use https://pokedex.org/ as our example.

1) In terminal:

create-react-app scraping-demo
cd scraping-demo
npm i request-promise
npm i cheerio
Enter fullscreen mode Exit fullscreen mode

2) We’re going to start by using request-promise to get the HTML from https://pokedex.org/ into a console log.

In App.js:

import React, { Component } from "react";
import rp from "request-promise";
import "./App.css";
class App extends Component {
state = {};
componentDidMount() {
// use the request-promise library to fetch the HTML from pokemon.org
rp("https://pokedex.org/")
.then(html => console.log(html))
}
render() {
return (
<div>
<p>hello</p>
</div>
);
}
}
export default App;
view raw App.js hosted with ❤ by GitHub

3) Sometimes you may come across a CORS error blocking you from fetching. For demonstration purposes, try fetching pokemon.com

rp("https://www.pokemon.com/us/pokedex/")
Enter fullscreen mode Exit fullscreen mode

You should see an error like this in the console:

error

4) You can get around CORS by using https://cors-anywhere.herokuapp.com. Simply add that URL before your desired fetch URL like so:

rp("https://cors-anywhere.herokuapp.com/https://www.pokemon.com/us/pokedex/")
Enter fullscreen mode Exit fullscreen mode

Now you should be able to see the HTML from pokemon.com show in your console.

5) But we won’t have to use cors-anywhere for rp("https://pokedex.org/"), so let’s proceed

console

6) Now that we have the HTML, let’s use the cheerio library to help us grab the exact data that we want from desired element tags. In this example, we’ll grab all the names of the pokemon then display them in a list.

In App.js:

import React, { Component } from "react";
import rp from "request-promise";
import cheerio from "cheerio";
import "./App.css";
class App extends Component {
state = { names: [] };
componentDidMount() {
// use the request-promise library to fetch the HTML from pokemon.org
rp("https://pokedex.org/")
.then(html => {
let names = [];
let $ = cheerio.load(html);
// find what element ids, classes, or tags you want from opening console in the browser
// cheerio library lets you select elements similar to querySelector
$("#monsters-list li span").each(function(i, element) {
let name = $(this)
.prepend()
.text();
names.push(name);
});
this.setState({ names });
})
.catch(function(err) {
console.log("crawl failed");
});
}
render() {
return (
<div>
<ul>
{this.state.names.map(name => {
return <li key={name}>{name}</li>;
})}
</ul>
</div>
);
}
}
export default App;
view raw App.js hosted with ❤ by GitHub

7) You should see a list of all the pokemon names display onto your screen:

list

It’s that simple! You scraped those names from the HTML without having to directly access any backend. Now try scraping the examples on http://toscrape.com/ for practice. Enjoy your new abilities!


Bring your friends and come learn JavaScript in a fun never before seen way! waddlegame.com

Image of Docusign

Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more

Top comments (0)