DEV Community

Cover image for Next.js 14 Booking App with Live Data Scraping using Scraping Browser

Next.js 14 Booking App with Live Data Scraping using Scraping Browser

Kishan Sheth on February 22, 2024

Table of Contents Introduction Tech Stack Features Setting Up the Next.js Application Installing Required Packages Setting Up Redis Con...
Collapse
 
patryk__dev profile image
Patryk Maron

I find that using Puppeteer or any headless browser for scraping in most cases is such an overkill. It's good for automated end to end testing, but for scraping data there are simpler and much more performant approaches.

In your case, you're grabbing data from Kayak. After a quick inspection in the network tab and playing around with the website, they return us all the data we need in the initial document HTML and we can use their routing as an API:

https://www.kayak.co.uk/flights/LON-NYC/2024-03-23/2024-03-30?sort=bestflight_a

The above url gives us back flights between London and New York, between the two dates specified. We can also sort the data the way we want it.

Now, a simple fetch method to get the initial HTML is sufficient, this way we avoid all other data that comes through after initial page load (analytics, client side fetches, css and js scripts etc.)

That initial document HTML has javascript code baked into it, with all the data hydrated in json format which we can extract easily using any html parsing library.

Collapse
 
moozzyk profile image
Pawel Kadluczka

This doesn't always work. For instance, you cannot do this with Craigslist gallery as Craigslist builds the DOM dynamically. In my project, I ended up using Puppeteer.

Collapse
 
patryk__dev profile image
Patryk Maron

Do you mean the carousel on the craiglist page?

The swiper element looks like some Jquery thing that dynamically adds images.

All the data for the images URLs are baked in the initial HTML:

Image description

In any case, if the above wasn't a thing, then you would listen to xhr calls on network tab to see where the images are coming from server side to try "hack" around it.

in 90% of cases puppeeter is a overkill, there comes 10% of times where it isnt

Thread Thread
 
moozzyk profile image
Pawel Kadluczka

I was talking about the gallery, which is a list of posts for a given category. Fetching used to return just this:

<noscript id="no-js"><div>
<p>We've detected that JavaScript is not enabled in your browser.</p>
<p>You must enable JavaScript to use craigslist.</p>
</div></noscript>
<div id="unsupported-browser">
<p>We've detected you are using a browser that is missing critical features.</p>
<p>Please visit craigslist from a modern browser.</p>
</div>
Enter fullscreen mode Exit fullscreen mode

(details: github.com/juliomalegria/python-cr...)

Looks like this has changed in the past few months, and I am now able to get the list of posts just with curl, so as you say using Puppeteer for this is overkill (and it is slow). But a few months ago my curl request would only return the HTML above.

Thread Thread
 
patryk__dev profile image
Patryk Maron

When you use curl, no Javascript is involved, and then it does not send the API request you are looking for.

Doing a fetch request with like Node.js, you can set your Agent Header or play around with Postman or similar and you will avoid no-js response

Thread Thread
 
moozzyk profile image
Pawel Kadluczka

HTML shown in the browser via View Page Source was the same. I found scripts that downloaded a bunch of cryptic JSON files and used them to build the DOM.

Collapse
 
patryk__dev profile image
Patryk Maron

Bonus to the above, I would remove zustand, and not store the data client side like that, then the component that display flights doesn't need to an client side component. We can achieve getting all the data with server components and make the app stateless that's reliant on the backend to get the data.

Collapse
 
bugger profile image
Vincent

actually if you play with meta search sites like kayak,expedia,etc for a long time, you'd probably find their webapp very tricky with such hacks. you'd either be blocked by cloudflare or ratelimited frequently. i am not saying using playwright would workaround completely but it does go through at a higher ratio.

Anyway, I found author's solution kind of great for a homelab showcase, though definitely needs lots of polishing for serious usage. Am I understanding right, Kishan?

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash

This is a great in-depth article, brother! Thanks! 🙌

Collapse
 
kishansheth profile image
Kishan Sheth

Thanks  ❤️

Collapse
 
norbertoe888 profile image
Norberto Cáceres

How come jobs are never added to the importQueue?