Table of Contents
Introduction
Tech Stack
Features
Setting Up the Next.js Application
Installing Required Packages
Setting Up Redis Con...
For further actions, you may consider blocking this person and/or reporting abuse
I find that using Puppeteer or any headless browser for scraping in most cases is such an overkill. It's good for automated end to end testing, but for scraping data there are simpler and much more performant approaches.
In your case, you're grabbing data from Kayak. After a quick inspection in the network tab and playing around with the website, they return us all the data we need in the initial document HTML and we can use their routing as an API:
https://www.kayak.co.uk/flights/LON-NYC/2024-03-23/2024-03-30?sort=bestflight_a
The above url gives us back flights between London and New York, between the two dates specified. We can also sort the data the way we want it.
Now, a simple fetch method to get the initial HTML is sufficient, this way we avoid all other data that comes through after initial page load (analytics, client side fetches, css and js scripts etc.)
That initial document HTML has javascript code baked into it, with all the data hydrated in json format which we can extract easily using any html parsing library.
This doesn't always work. For instance, you cannot do this with Craigslist gallery as Craigslist builds the DOM dynamically. In my project, I ended up using Puppeteer.
Do you mean the carousel on the craiglist page?
The swiper element looks like some Jquery thing that dynamically adds images.
All the data for the images URLs are baked in the initial HTML:
In any case, if the above wasn't a thing, then you would listen to xhr calls on network tab to see where the images are coming from server side to try "hack" around it.
in 90% of cases puppeeter is a overkill, there comes 10% of times where it isnt
I was talking about the gallery, which is a list of posts for a given category. Fetching used to return just this:
(details: github.com/juliomalegria/python-cr...)
Looks like this has changed in the past few months, and I am now able to get the list of posts just with
curl
, so as you say using Puppeteer for this is overkill (and it is slow). But a few months ago mycurl
request would only return the HTML above.When you use curl, no Javascript is involved, and then it does not send the API request you are looking for.
Doing a fetch request with like Node.js, you can set your Agent Header or play around with Postman or similar and you will avoid no-js response
HTML shown in the browser via View Page Source was the same. I found scripts that downloaded a bunch of cryptic JSON files and used them to build the DOM.
Bonus to the above, I would remove zustand, and not store the data client side like that, then the component that display flights doesn't need to an client side component. We can achieve getting all the data with server components and make the app stateless that's reliant on the backend to get the data.
actually if you play with meta search sites like kayak,expedia,etc for a long time, you'd probably find their webapp very tricky with such hacks. you'd either be blocked by cloudflare or ratelimited frequently. i am not saying using playwright would workaround completely but it does go through at a higher ratio.
Anyway, I found author's solution kind of great for a homelab showcase, though definitely needs lots of polishing for serious usage. Am I understanding right, Kishan?
This is a great in-depth article, brother! Thanks! 🙌
Thanks ❤️
How come jobs are never added to the importQueue?