Peter Hansen

Posted on Feb 12, 2020 • Edited on May 20, 2021

How to scrape HTML from a website built with Javascript?

Hello World ✌️,

In this article, I would like to tell about how you can scrape HTML content from a website build with the Javascript framework.

But why it is even a problem to scrape a JS-based website? 🤔

Problem Definition:

You need to have a browser environment in order to execute Javascript code that will render HTML.

If you will try open this website (https://web-scraping-playground-site.firebaseapp.com) in your browser — you will see a simple page with some content.

However, if you will try to send HTTP GET request to the same url in the Postman — you will see a different response.

A response to GET request ‘https://web-scraping-playground-site.firebaseapp.com’ in made in the Postman.

What? Why the response contains no HTML? It is happening because there is no browser environment when we sending requests from a server or Postman app.

🎓 We need a browser environment for executing Javascript code and rendering content — HTML.

It sounds like an easy and fun problem to solve! In the below 👇 section I will show 2 ways how to solve the above-mentioned problem using:

Puppeteer — a Node library developed by Google.
Proxybot — an API service for web scraping.

Let's get started 👨‍💻

For people who prefer watching videos, there is a quick video 🎥 demonstrating how to get an HTML content of a JS-based website.

Solution using Puppeteer

The idea is simple. Use puppeteer on our server for simulating the browser environment in order to render HTML of a page and use it for scraping or something else 😉.

See the below code snippet.

	const puppeteer = require('puppeteer');

	const express = require('express');
	const app = express();
	const port = 3000;

	app.get('/', async (req, res) => {
	const {url} = req.query;
	if(!url) {
	res.status(400).send("Bad request: 'url' param is missing!");
	return;
	}

	try {
	const html = await getPageHTML(url);

	res.status(200).send(html);
	} catch (error) {
	res.status(500).send(error);
	}
	});

	const getPageHTML = async (pageUrl) => {
	const browser = await puppeteer.launch();

	const page = await browser.newPage();

	await page.goto(pageUrl);

	const pageHTML = await page.evaluate('new XMLSerializer().serializeToString(document.doctype) + document.documentElement.outerHTML');

	await browser.close();

	return pageHTML;
	}

	app.listen(port, () => console.log(`Example app listening on port ${port}!`))

This code simply:

Accepts GET request
Receives ‘url’ param
Returns response of the ‘getPageHTML’ function

The ‘getPageHTML’ function is the most interesting for us because that’s where the magic happens.

The ‘magic’ is, however, pretty simple. The function simply does the following steps:

Launch puppeteer
Open the desired url
Internally executes JS
Extract HTML of the page
Return the HTML

Easy-peasy 👏

Let’s run the script and send a request to http://localhost:3000?url=https://web-scraping-playground-site.firebaseapp.com in the Postman app.

The below screenshot shows the response from our local server.

Yaaaaay 🎉🎉🎉 We Did it! Great job guys! We got HTML back!

It was easy, but it can be even easier, let’s have a look at the second approach.

Solution using Proxybot

With this approach, we actually only need to send an HTTP GET request. The API service will run a virtual browser internally and send you back HTML.

https://proxybot.io/api/v1/API_KEY?render_js=true&url=your-url-here

Let’s try to call the API in the Postman app.

Yaaay 🎊🎊🎊 More HTML!

There is not much to say about the request, because it is pretty straightforward. However, I want to emphasize a small detail. When calling the API to remember to include the render_js=true url param.

Otherwise, the service will not execute Javascript 🤓

Congratulations 🥳 Now you can scrape websites build with javascript frameworks like Angular, React, Ember etc..

I hope this article was interesting and useful.

Proxybot it just one of the services allowing you to proxy your requests. If you are looking for proxy providers here you can find a list with best proxy providers.

Meet your AI code assistant

Top comments (1)

cliffgold • Jan 24 '22

I want to do something similar, but for an external website. Is there a way to data scrape an ember-built site? The IDs change every build, so I can't use those.

DEV Community

How to scrape HTML from a website built with Javascript?

🎓 We need a browser environment for executing Javascript code and rendering content — HTML.

Solution using Puppeteer

Solution using Proxybot

Meet your AI code assistant

Top comments (1)

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Read next

New AI System Cuts False Information by 20% Using Smart Information Processing Framework

Zero-Shot Foundation Models Match Traditional Forecasting in Cloud Computing Metrics, Study Shows

Web-Scraped Image Dataset Boosts AI's Understanding of Visual Context by 15%

Study Shows AI Code Generators Only 60% Accurate, Half With Security Flaws

Okay