DEV Community

Next.js 14 Booking App with Live Data Scraping using Scraping Browser

Kishan Sheth on February 22, 2024

Table of Contents Introduction Tech Stack Features Setting Up the Next.js Application Installing Required Packages Setting Up Redis Con...

Read full post

Patryk Maron • Feb 22 '24

I find that using Puppeteer or any headless browser for scraping in most cases is such an overkill. It's good for automated end to end testing, but for scraping data there are simpler and much more performant approaches.

In your case, you're grabbing data from Kayak. After a quick inspection in the network tab and playing around with the website, they return us all the data we need in the initial document HTML and we can use their routing as an API:

https://www.kayak.co.uk/flights/LON-NYC/2024-03-23/2024-03-30?sort=bestflight_a

The above url gives us back flights between London and New York, between the two dates specified. We can also sort the data the way we want it.

Now, a simple fetch method to get the initial HTML is sufficient, this way we avoid all other data that comes through after initial page load (analytics, client side fetches, css and js scripts etc.)

That initial document HTML has javascript code baked into it, with all the data hydrated in json format which we can extract easily using any html parsing library.

Pawel Kadluczka • Feb 23 '24

This doesn't always work. For instance, you cannot do this with Craigslist gallery as Craigslist builds the DOM dynamically. In my project, I ended up using Puppeteer.

Patryk Maron • Mar 6 '24

Do you mean the carousel on the craiglist page?

The swiper element looks like some Jquery thing that dynamically adds images.

All the data for the images URLs are baked in the initial HTML:

In any case, if the above wasn't a thing, then you would listen to xhr calls on network tab to see where the images are coming from server side to try "hack" around it.

in 90% of cases puppeeter is a overkill, there comes 10% of times where it isnt

Pawel Kadluczka • Mar 6 '24

I was talking about the gallery, which is a list of posts for a given category. Fetching used to return just this:

<noscript id="no-js"><div>
<p>We've detected that JavaScript is not enabled in your browser.</p>
<p>You must enable JavaScript to use craigslist.</p>
</div></noscript>
<div id="unsupported-browser">
<p>We've detected you are using a browser that is missing critical features.</p>
<p>Please visit craigslist from a modern browser.</p>
</div>

(details: github.com/juliomalegria/python-cr...)

Looks like this has changed in the past few months, and I am now able to get the list of posts just with curl, so as you say using Puppeteer for this is overkill (and it is slow). But a few months ago my curl request would only return the HTML above.

Patryk Maron • Mar 6 '24

When you use curl, no Javascript is involved, and then it does not send the API request you are looking for.

Doing a fetch request with like Node.js, you can set your Agent Header or play around with Postman or similar and you will avoid no-js response

Pawel Kadluczka • Mar 7 '24

HTML shown in the browser via View Page Source was the same. I found scripts that downloaded a bunch of cryptic JSON files and used them to build the DOM.

Vincent • Sep 15 '24

actually if you play with meta search sites like kayak,expedia,etc for a long time, you'd probably find their webapp very tricky with such hacks. you'd either be blocked by cloudflare or ratelimited frequently. i am not saying using playwright would workaround completely but it does go through at a higher ratio.

Anyway, I found author's solution kind of great for a homelab showcase, though definitely needs lots of polishing for serious usage. Am I understanding right, Kishan?

Patryk Maron • Feb 22 '24

Bonus to the above, I would remove zustand, and not store the data client side like that, then the component that display flights doesn't need to an client side component. We can achieve getting all the data with server components and make the app stateless that's reliant on the backend to get the data.