Hello World ✌️,
But why it is even a problem to scrape a JS-based website? 🤔
If you will try open this website (https://web-scraping-playground-site.firebaseapp.com) in your browser — you will see a simple page with some content.
However, if you will try to send HTTP GET request to the same url in the Postman — you will see a different response.
A response to GET request ‘https://web-scraping-playground-site.firebaseapp.com’ in made in the Postman.
What? Why the response contains no HTML? It is happening because there is no browser environment when we sending requests from a server or Postman app.
It sounds like an easy and fun problem to solve! In the below 👇 section I will show 2 ways how to solve the above-mentioned problem using:
Let's get started 👨💻
For people who prefer watching videos, there is a quick video 🎥 demonstrating how to get an HTML content of a JS-based website.
The idea is simple. Use puppeteer on our server for simulating the browser environment in order to render HTML of a page and use it for scraping or something else 😉.
See the below code snippet.
This code simply:
- Accepts GET request
- Receives ‘url’ param
- Returns response of the ‘getPageHTML’ function
The ‘getPageHTML’ function is the most interesting for us because that’s where the magic happens.
The ‘magic’ is, however, pretty simple. The function simply does the following steps:
- Launch puppeteer
- Open the desired url
- Internally executes JS
- Extract HTML of the page
- Return the HTML
Let’s run the script and send a request to http://localhost:3000?url=https://web-scraping-playground-site.firebaseapp.com in the Postman app.
The below screenshot shows the response from our local server.
Yaaaaay 🎉🎉🎉 We Did it! Great job guys! We got HTML back!
It was easy, but it can be even easier, let’s have a look at the second approach.
With this approach, we actually only need to send an HTTP GET request. The API service will run a virtual browser internally and send you back HTML.
Let’s try to call the API in the Postman app.
Yaaay 🎊🎊🎊 More HTML!
There is not much to say about the request, because it is pretty straightforward. However, I want to emphasize a small detail. When calling the API to remember to include the
render_js=true url param.
I hope this article was interesting and useful.