DEV Community

Cover image for Next.js 14 Booking App with Live Data Scraping using Scraping Browser
Kishan Sheth
Kishan Sheth

Posted on

Next.js 14 Booking App with Live Data Scraping using Scraping Browser

Table of Contents

Introduction

In the ever-evolving landscape of web development, the ability to efficiently gather, process, and display data from external sources has become increasingly valuable. Whether for market research, competitive analysis, or customer insights, web scraping plays a crucial role in unlocking the vast potential of the internet's data.

This blog post introduces a comprehensive guide to building a robust Next.js application designed for scraping flight data from Kayak, one of the leading travel search engines. By leveraging the power of Next.js alongside modern technologies such as BullMQ, Redis, and Puppeteer.

Tech Stack

Features

  • 🚀 Next.js 14 App Directory with Tailwind CSS - Experience the sleek and modern UI powered by the latest Next.js 14 and styled with Tailwind CSS for that perfect look and feel.
  • 🔗 API Routes & Server Actions - Dive into seamless backend integration with Next.js 14's API routes and server actions ensuring efficient data handling and server-side logic execution.
  • 🕷 Scraping with Puppeteer Redis and BullMQ - Harness the power of Puppeteer for advanced web scraping with Redis and BullMQ managing queues and jobs for robust backend operations.
  • 🔑 JWT Tokens for Authentication and Authorization - Secure your app with JWT tokens providing a reliable method for authentication and authorization across your platform.
  • 💳 Stripe for Payment Gateways - Integrate Stripe for seamless payment processing enabling secure and easy transactions for booking trips flights and hotels.
  • ✈️ Book Trips Flights and Hotels with Stripe Payment Gateway - Make your travel booking experience effortless with our Stripe-powered payment system.
  • 📊 Scrape Live Data from Multiple Websites - Stay ahead with real-time data scraping from multiple sources keeping your app updated with the latest information.
  • 💾 Store the Scraped Data in PostgreSQL with Prisma - Leverage PostgreSQL and Prisma for efficient storage and management of your scraped data ensuring reliability and speed.
  • 🔄 Zustand for State Management - Enjoy smooth and manageable state management in your app with Zustand simplifying state logic and enhancing performance.
  • 😈 Best Feature of the App - Scraping the Unscrapable Data with Bright Data's Scraping Browser.

Scraping Browser Meme

Bright Data's Scraping Browser provides us with an automatic captcha-solving feature that helps us scrape the un-scrapable data.

Step 1: Setting Up the Next.js Application

  1. Create a Next.js App: Start by creating a new Next.js app if you haven't already. You can do this by running the following command in your terminal:


npx create-next-app@latest booking-app


Enter fullscreen mode Exit fullscreen mode
  1. Navigate to Your App Directory: Change into your newly created app directory:


cd booking-app


Enter fullscreen mode Exit fullscreen mode

Step 2: Installing Required Packages

You'll need to install several packages, including Redis, BullMQ, and Puppeteer Core. Run the following command to install them:



npm install ioredis bullmq puppeteer-core


Enter fullscreen mode Exit fullscreen mode
  • ioredis is a robust Redis client for Node.js, enabling communication with Redis.
  • bullmq manages job and message queues with Redis as the backend.
  • puppeteer-core allows you to control an external browser for scraping purposes.

Step 3: Setting Up Redis Connection

Create a file (e.g., redis.js) in a suitable directory (e.g., lib/) to configure the Redis connection:



// lib/redis.js
import Redis from 'ioredis';

// Use REDIS_URL from environment or fallback to localhost
const REDIS_URL = process.env.REDIS_URL || 'redis://localhost:6379';
const connection = new Redis(REDIS_URL);

export { connection };


Enter fullscreen mode Exit fullscreen mode

Step 4: Configuring BullMQ Queue

Set up the BullMQ queue by creating another file (e.g., queue.js) in the same directory as your Redis configuration:



// lib/queue.js
import { Queue } from 'bullmq';
import { connection } from './redis';

export const importQueue = new Queue('importQueue', {
  connection,
  defaultJobOptions: {
    attempts: 2,
    backoff: {
      type: 'exponential',
      delay: 5000,
    },
  },
});


Enter fullscreen mode Exit fullscreen mode

Step 5: Next.js Instrumentation Setup

Next.js allows instrumentation, which can be enabled in your Next.js configuration. You'll also need to create a worker file for job processing.

1.Enable Instrumentation in Next.js: Add the following to your next.config.js to enable instrumentation:



// next.config.js
module.exports = {
  experimental: {
    instrumentationHook: true,
  },
};


Enter fullscreen mode Exit fullscreen mode

2.Create a Worker for Job Processing: In your application, create a file (instrumentation.js) to handle job processing. This worker will use Puppeteer for scraping tasks:



// instrumentation.js
export const register = async () => {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    const { Worker } = await import('bullmq');
    const puppeteer = await import('puppeteer-core');
    const { connection } = await import('./lib/redis');
    const { importQueue } = await import('./lib/queue');

    new Worker('importQueue', async (job) => {
      // Job processing logic with Puppeteer goes here
    }, {
      connection,
      concurrency: 10,
      removeOnComplete: { count: 1000 },
      removeOnFail: { count: 5000 },
    });
  }
};


Enter fullscreen mode Exit fullscreen mode

Step 6: Setting up Bright Data's Scraping Browser

Before setting up Bright Data Scraping Browser let's talk about what a scraping browser is.

What is Bright Data's scraping browser?

Bright Data's Scraping Browser is a cutting-edge tool for automated web scraping, designed to seamlessly integrate with Puppeteer, Playwright, and Selenium. It offers a suite of website unblocking features, including proxy rotation, CAPTCHA solving, and more, to enhance scraping efficiency. Ideal for complex web scraping requiring interactions, it allows scalability by hosting unlimited browser sessions on Bright Data’s infrastructure. For more details, visit Bright Data.

Bright Data Scraping Browser

Step 1: Navigate to Bright Data's Website

Begin by heading over to Brightdata.com. This is your gateway to accessing the wealth of web scraping resources and tools offered by Bright Data.

Bright Data home page

Step 2: Create an Account

Once you're on Bright Data's website, sign up to create a new account. You'll be prompted to enter essential information to get your account up and running.

Login/Signup Bright Data

Step 3: Select Your Product

On the product selection page, look for the Proxies & Scraping Infrastructure product. This product is specifically designed to meet your web scraping needs, offering powerful tools and features for data extraction.

Bright Data products

Step 4: Add a New Proxy

Within the Proxies & Scraping Infrastructure page, you'll find an "add new button." Click on this to start the process of adding a new scraping browser to your toolkit.

New Proxy

Step 5: Choose the Scraping Browser

A dropdown list will appear, from which you should select the scraping browser option. This tells Bright Data that you intend to set up a new scraping browser environment.

Choose Scraping Browser

Step 6: Name Your Scraping Browser

Give your new scraping browser a unique name. This helps in identifying and managing it later, especially if you plan to use multiple browsers for different scraping projects.

Scraping Browser Name

Step 7: Add the Browser

After naming your browser, click on the "add" button. This action finalizes the creation of your new scraping browser.

Add the scraping browser

Step 8: View Your Scraping Browser Details

Upon adding your scraping browser, you will be directed to a page where you can see all the details of your newly created scraping browser. This information is crucial for integration and use.

Scraping Browser Details

Step 9: Access Code and Integration Examples

Look for the "check out code and integration examples" button. Clicking this will provide you with a comprehensive view of how to integrate and use your scraping browser across multiple programming languages and libraries. This resource is invaluable for developers looking to customize their scraping setup.

Code and Integration Examples Button

Step 10: Integrate Your Scraping Browser

Finally, copy the SRS_WS_ENDPOINT variable. This is a critical piece of information that you will need to integrate into your source code, allowing your applications to communicate with the scraping browser you've just set up.

Scraping Browser Endpoint

By following these detailed steps, you have successfully created a scraping browser within Bright Data's platform, ready to tackle your web scraping tasks. Remember, Bright Data offers extensive documentation and support to help you maximize your scraping projects' efficiency and effectiveness. Whether you're gathering market intelligence, conducting research, or monitoring competitive landscapes, your newly set up scraping browser is a powerful tool in your data collection arsenal.

Step 7: Implementing the Scraping Logic with Puppeteer

Continuing from where we left off in setting up our Next.js application for scraping flight data, the next critical step is to implement the actual scraping logic. This process involves utilizing Puppeteer to connect to a browser instance, navigate to the target URL (in our case, Kayak), and scrape the necessary flight data. The code snippet provided outlines a sophisticated method for achieving this goal, seamlessly integrating with our previously established BullMQ worker setup. Let's break down the components of this scraping logic and understand how it fits into our application.

Establishing a Connection to the Browser

The first step in our scraping process is to establish a connection to the browser through Puppeteer. This is accomplished by utilizing the puppeteer.connect method, which connects to an existing browser instance using a WebSocket endpoint (SBR_WS_ENDPOINT). This environment variable should be set to the WebSocket URL of the scraping browser service you're using, such as Bright Data:



const browser = await puppeteer.connect({
  browserWSEndpoint: SBR_WS_ENDPOINT,
});


Enter fullscreen mode Exit fullscreen mode

Opening a New Page and Navigating to the Target URL

Once connected, we create a new page in the browser and navigate to the target URL specified in the job data. This URL is the specific Kayak search result page from which we intend to scrape flight data:



const page = await browser.newPage();
await page.goto(job.data.url);


Enter fullscreen mode Exit fullscreen mode

Scraping Flight Data

The core of our logic lies in scraping the flight data from the page. We achieve this by using page.evaluate, a Puppeteer method that allows us to run scripts in the context of the browser. Within this script, we wait for the necessary elements to load and then proceed to collect flight information:

  • Flight Selector: We target elements with the class .nrc6-wrapper, which contain flight details.
  • Data Extraction: For each flight element, we extract details such as the airline logo, departure and arrival times, flight duration, airline name, and price. The departure and arrival times are cleaned to remove unnecessary numeric values at the end, ensuring we capture the time accurately.
  • Price Processing: The price is extracted as an integer after removing all non-numeric characters, ensuring it can be used for numerical operations or comparisons.

The extracted data is structured into an array of flight objects, each containing the details mentioned above:



const scrappedFlights = await page.evaluate(async () => {
  // Data extraction logic
  const flights = [];
  // Process each flight element
  // ...
  return flights;
});


Enter fullscreen mode Exit fullscreen mode

Error Handling and Cleanup

Our scraping logic is wrapped in a try-catch block to handle any potential errors gracefully during the scraping process. Regardless of the outcome, we ensure the browser is closed properly in the finally block, maintaining resource efficiency and preventing potential memory leaks:



try {
  // Scraping logic
} catch (error) {
  console.log({ error });
} finally {
  await browser.close();
  console.log("Browser closed successfully.");
}


Enter fullscreen mode Exit fullscreen mode

The entire code



const SBR_WS_ENDPOINT = process.env.SBR_WS_ENDPOINT;

export const register = async () => {

  if (process.env.NEXT_RUNTIME === "nodejs") {


    const { Worker } = await import("bullmq");
    const puppeteer = await import("puppeteer");
    const { connection } = await import("./lib/redis");
    const { importQueue } = await import("./lib/queue");

    new Worker(
      "importQueue",
      async (job) => {
        const browser = await puppeteer.connect({
          browserWSEndpoint: SBR_WS_ENDPOINT,
        });

        try {
          const page = await browser.newPage();

          console.log("in flight scraping");
          console.log("Connected! Navigating to " + job.data.url);
          await page.goto(job.data.url);
          console.log("Navigated! Scraping page content...");
          const scrappedFlights = await page.evaluate(async () => {
            await new Promise((resolve) => setTimeout(resolve, 5000));

            const flights = [];

            const flightSelectors = document.querySelectorAll(".nrc6-wrapper");

            flightSelectors.forEach((flightElement) => {
              const airlineLogo = flightElement.querySelector("img")?.src || "";
              const [rawDepartureTime, rawArrivalTime] = (
                flightElement.querySelector(".vmXl")?.innerText || ""
              ).split("");

              // Function to extract time and remove numeric values at the end
              const extractTime = (rawTime: string): string => {
                const timeWithoutNumbers = rawTime
                  .replace(/[0-9+\s]+$/, "")
                  .trim();
                return timeWithoutNumbers;
              };

              const departureTime = extractTime(rawDepartureTime);
              const arrivalTime = extractTime(rawArrivalTime);
              const flightDuration = (
                flightElement.querySelector(".xdW8")?.children[0]?.innerText ||
                ""
              ).trim();

              const airlineName = (
                flightElement.querySelector(".VY2U")?.children[1]?.innerText ||
                ""
              ).trim();

              // Extract price
              const price = parseInt(
                (
                  flightElement.querySelector(".f8F1-price-text")?.innerText ||
                  ""
                )
                  .replace(/[^\d]/g, "")
                  .trim(),
                10
              );

              flights.push({
                airlineLogo,
                departureTime,
                arrivalTime,
                flightDuration,
                airlineName,
                price,
              });
            });

            return flights;
          });
        } catch (error) {
          console.log({ error });
        } finally {
          await browser.close();
          console.log("Browser closed successfully.");
        }
      },
      {
        connection,
        concurrency: 10,
        removeOnComplete: { count: 1000 },
        removeOnFail: { count: 5000 },
      }
    );
  }
};



Enter fullscreen mode Exit fullscreen mode

Step 8: Flight Search Feature

Building upon our flight data scraping functionality, let's integrate a comprehensive flight search feature into our Next.js application. This feature will provide users with a dynamic interface to search for flights by specifying the source, destination, and date. Leveraging the powerful Next.js framework alongside a modern UI library and state management, we create an engaging and responsive flight search experience.

Key Components of the Flight Search Feature

  1. Dynamic City Selection: The feature includes an autocomplete functionality for source and destination inputs, powered by a pre-defined list of city-airport codes. As users type, the application filters and displays matching cities, enhancing the user experience by making it easier to find and select airports.

  2. Date Selection: Users can select their intended flight date through a date input, providing flexibility in planning their travel.

  3. Scraping Status Monitoring: After initiating a scraping job, the application monitors the job's status through periodic API calls. This asynchronous checking allows the app to update the UI with the status of the scraping process, ensuring users are informed of the progress and results.

Complete Code for the Flight Search Component



"use client";
import { useAppStore } from "@/store";
import { USER_API_ROUTES } from "@/utils/api-routes";
import { cityAirportCode } from "@/utils/city-airport-codes";
import { Button, Input, Listbox, ListboxItem } from "@nextui-org/react";
import axios from "axios";
import Image from "next/image";
import { useRouter } from "next/navigation";
import React, { useEffect, useRef, useState } from "react";
import { FaCalendarAlt, FaSearch } from "react-icons/fa";

const SearchFlights = () => {
  const router = useRouter();
  const { setScraping, setScrapingType, setScrappedFlights } = useAppStore();
  const [loadingJobId, setLoadingJobId] = useState<number | undefined>(undefined);
  const [source, setSource] = useState("");
  const [sourceOptions, setSourceOptions] = useState<
    { city: string; code: string; }[]
  >([]);
  const [destination, setDestination] = useState("");
  const [destinationOptions, setDestinationOptions] = useState<
    { city: string; code: string; }[]
  >([]);
  const [flightDate, setFlightDate] = useState(() => {
    const today = new Date();
    return today.toISOString().split("T")[0];
  });

  const handleSourceChange = (query: string) => {
    const matchingCities = Object.entries(cityAirportCode)
      .filter(([, city]) => city.toLowerCase().includes(query.toLowerCase()))
      .map(([code, city]) => ({ code, city }))
      .splice(0, 5);

    setSourceOptions(matchingCities);
  };

  const destinationChange = (query: string) => {
    const matchingCities = Object.entries(cityAirportCode)
      .filter(([, city]) => city.toLowerCase().includes(query.toLowerCase()))
      .map(([code, city]) => ({ code, city }))
      .splice(0, 5);

    setDestinationOptions(matchingCities);
  };

  const startScraping = async () => {
    if (source && destination && flightDate) {
      const data = await axios.get(`${USER_API_ROUTES.FLIGHT_SCRAPE}?source=${source}&destination=${destination}&date=${flightDate}`);
      if (data.data.id) {
        setLoadingJobId(data.data.id);
        setScraping(true);
        setScrapingType("flight");
      }
    }
  };

  useEffect(() => {
    if (loadingJobId) {
      const checkIfJobCompleted = async () => {
        try {
          const response = await axios.get(`${USER_API_ROUTES.FLIGHT_SCRAPE_STATUS}?jobId=${loadingJobId}`);
          if (response.data.status) {
            set

ScrappedFlights(response.data.flights);
            clearInterval(jobIntervalRef.current);
            setScraping(false);
            setScrapingType(undefined);
            router.push(`/flights?data=${flightDate}`);
          }
        } catch (error) {
          console.log(error);
        }
      };
      jobIntervalRef.current = setInterval(checkIfJobCompleted, 3000);
    }

    return () => clearInterval(jobIntervalRef.current);
  }, [loadingJobId]);

  return (
    <div className="h-[90vh] flex items-center justify-center">
      <div className="absolute left-0 top-0 h-[100vh] w-[100vw] max-w-[100vw] overflow-hidden overflow-x-hidden">
        <Image src="/flight-search.png" fill alt="Search" />
      </div>
      <div className="absolute h-[50vh] w-[60vw] flex flex-col gap-5">
        {/* UI and functionality for flight search */}
      </div>
    </div>
  );
};

export default SearchFlights;


Enter fullscreen mode Exit fullscreen mode

Step 9: Flight Search Page UI

Flight Search Page

Displaying Flight Search Results

After successfully scraping flight data, the next crucial step is to present these results to the users in a user-friendly manner. The Flights component in your Next.js application is designed for this purpose.



"use client";

import { useAppStore } from "@/store";
import { USER_API_ROUTES } from "@/utils/api-routes";
import { Button } from "@nextui-org/react";
import axios from "axios";
import Image from "next/image";
import { useRouter, useSearchParams } from "next/navigation";
import React from "react";
import { FaChevronLeft } from "react-icons/fa";
import { MdOutlineFlight } from "react-icons/md";

const Flights = () => {
  const router = useRouter();
  const searchParams = useSearchParams();
  const date = searchParams.get("date");
  const { scrappedFlights, userInfo } = useAppStore();
  const getRandomNumber = () => Math.floor(Math.random() * 41);

  const bookFLight = async (flightId: number) => {};

  return (
    <div className="m-10 px-[20vw] min-h-[80vh]">
      <Button
        className="my-5"
        variant="shadow"
        color="primary"
        size="lg"
        onClick={() => router.push("/search-flights")}
      >
        <FaChevronLeft /> Go Back
      </Button>
      <div className="flex-col flex gap-5">
        {scrappedFlights.length === 0 && (
          <div className="flex items-center justify-center py-5 px-10 mt-10 rounded-lg text-red-500 bg-red-100 font-medium">
            No Flights found.
          </div>
        )}
        {scrappedFlights.map((flight: any) => {
          const seatsLeft = getRandomNumber();
          return (
            <div
              key={flight.id}
              className="grid grid-cols-12 border bg-gray-200 rounded-xl font-medium drop-shadow-md"
            >
              <div className="col-span-9 bg-white rounded-l-xl p-10 flex flex-col gap-5">
                <div className="grid grid-cols-4 gap-4">
                  <div className="flex flex-col gap-3 font-medium">
                    <div>
                      <div className="relative w-20 h-16">
                        <Image src={flight.logo} alt="airline name" fill />
                      </div>
                    </div>
                    <div>{flight.name}</div>
                  </div>
                  <div className="col-span-3 flex justify-between">
                    <div className="flex flex-col gap-2">
                      <div className="text-blue-600">From</div>
                      <div>
                        <span className="text-3xl">
                          <strong>{flight.departureTime}</strong>
                        </span>
                      </div>
                      <div>{flight.from}</div>
                    </div>
                    <div className="flex flex-col items-center justify-center gap-2">
                      <div className="bg-violet-100 w-max p-3 text-4xl text-blue-600 rounded-full">
                        <MdOutlineFlight />
                      </div>
                      <div>
                        <span className="text-lg">
                          <strong>Non-stop</strong>
                        </span>
                      </div>
                      <div>{flight.duration}</div>
                    </div>

                    <div className="flex flex-col gap-2">
                      <div className="text-blue-600">To</div>
                      <div>
                        <span className="text-3xl">
                          <strong>{flight.arrivalTime}</strong>
                        </span>
                      </div>
                      <div>{flight.to}</div>
                    </div>
                  </div>
                </div>
                <div className="flex justify-center gap-10 bg-violet-100 p-3 rounded-lg">
                  <div className="flex">
                    <span>Airplane&nbsp;&nbsp;</span>
                    <span className="text-blue-600 font-semibold">
                      Boeing 787
                    </span>
                  </div>
                  <div className="flex">
                    <span>Travel Class:&nbsp;&nbsp;</span>
                    <span className="text-blue-600 font-semibold">Economy</span>
                  </div>
                </div>
                <div className="flex justify-between font-medium">
                  <div>
                    Refundable <span className="text-blue-600"> $5 ecash</span>
                  </div>
                  <div
                    className={`${
                      seatsLeft > 20 ? "text-green-500" : "text-red-500"
                    }`}
                  >
                    Only {seatsLeft} Seats Left
                  </div>
                  <div className="cursor-pointer">Flight Details</div>
                </div>
              </div>
              <div className="col-span-3 bg-violet-100 rounded-r-xl h-full flex flex-col items-center justify-center gap-5">
                <div>
                  <div>
                    <span className="line-through font-light">
                      ${flight.price + 140}
                    </span>
                  </div>
                  <div className="flex items-center gap-2">
                    <span className="text-5xl font-bold">${flight.price}</span>
                    <span className="text-blue-600">20% OFF</span>
                  </div>
                </div>
                <Button
                  variant="ghost"
                  radius="full"
                  size="lg"
                  color="primary"
                  onClick={() => {
                    if (userInfo) bookFLight(flight.id);
                  }}
                >
                  {userInfo ? "Book Now" : "Login to Book"}
                </Button>
              </div>
            </div>
          );
        })}
      </div>
    </div>
  );
};

export default Flights;


Enter fullscreen mode Exit fullscreen mode

Flight Search Results

Flight Search Results

Discover the Complete Guide and Codebase

The sections and code snippets shared above represent just a fraction of the full functionality and code necessary to build a robust flight data scraping and search application using Next.js. To grasp the entirety of this project, including advanced features, optimizations, and best practices, I invite you to dive deeper through my comprehensive resources available online.

Watch the Detailed Explanation on YouTube

For a step-by-step video guide that walks you through the development process, coding nuances, and functionality of this application, check out my YouTube video. This tutorial is designed to provide you with a deeper understanding of the concepts, allowing you to follow along at your own pace and gain valuable insights into Next.js application development.

Explore the Full Code on GitHub

If you're eager to explore the code in its entirety, head over to my GitHub repository. There, you'll find the complete codebase, including all the components, utilities, and setup instructions you need to get this application running on your own machine.

Arklyte Travel Planner

project-image

Travel Planner App with Live Web Scraping from various sources using Bright Data scraping browser.

Project Screenshots:

project-screenshot

🧐 Features

Here are some of the project's best features:

  • 🚀 Next.js 14 App Directory with Tailwind CSS - Experience the sleek and modern UI powered by the latest Next.js 14 and styled with Tailwind CSS for that perfect look and feel.
  • 🔗 API Routes & Server Actions - Dive into seamless backend integration with Next.js 14's API routes and server actions ensuring efficient data handling and server-side logic execution.
  • 🕷 Scraping with Puppeteer Redis and BullMQ - Harness the power of Puppeteer for advanced web scraping with Redis and BullMQ managing queues and jobs for robust backend operations.
  • 🔑 JWT Tokens for Authentication and Authorization - Secure your app with JWT tokens providing a reliable method for authentication and authorization across your platform.
  • 💳 Stripe for Payment Gateways -…




Conclusion

Building a comprehensive application like the flight data scraping and search tool with Next.js showcases the power and versatility of modern web development tools and frameworks. Whether you're a seasoned developer looking to refine your skills or a beginner eager to dive into web development, these resources are tailored to support your journey. Watch the detailed tutorial on YouTube, explore the full code on GitHub, and join the conversation to enhance your development expertise and contribute to the vibrant developer community.

Top comments (11)

Collapse
 
patryk__dev profile image
Patryk Maron

I find that using Puppeteer or any headless browser for scraping in most cases is such an overkill. It's good for automated end to end testing, but for scraping data there are simpler and much more performant approaches.

In your case, you're grabbing data from Kayak. After a quick inspection in the network tab and playing around with the website, they return us all the data we need in the initial document HTML and we can use their routing as an API:

https://www.kayak.co.uk/flights/LON-NYC/2024-03-23/2024-03-30?sort=bestflight_a

The above url gives us back flights between London and New York, between the two dates specified. We can also sort the data the way we want it.

Now, a simple fetch method to get the initial HTML is sufficient, this way we avoid all other data that comes through after initial page load (analytics, client side fetches, css and js scripts etc.)

That initial document HTML has javascript code baked into it, with all the data hydrated in json format which we can extract easily using any html parsing library.

Collapse
 
moozzyk profile image
Pawel Kadluczka

This doesn't always work. For instance, you cannot do this with Craigslist gallery as Craigslist builds the DOM dynamically. In my project, I ended up using Puppeteer.

Collapse
 
patryk__dev profile image
Patryk Maron

Do you mean the carousel on the craiglist page?

The swiper element looks like some Jquery thing that dynamically adds images.

All the data for the images URLs are baked in the initial HTML:

Image description

In any case, if the above wasn't a thing, then you would listen to xhr calls on network tab to see where the images are coming from server side to try "hack" around it.

in 90% of cases puppeeter is a overkill, there comes 10% of times where it isnt

Thread Thread
 
moozzyk profile image
Pawel Kadluczka

I was talking about the gallery, which is a list of posts for a given category. Fetching used to return just this:

<noscript id="no-js"><div>
<p>We've detected that JavaScript is not enabled in your browser.</p>
<p>You must enable JavaScript to use craigslist.</p>
</div></noscript>
<div id="unsupported-browser">
<p>We've detected you are using a browser that is missing critical features.</p>
<p>Please visit craigslist from a modern browser.</p>
</div>
Enter fullscreen mode Exit fullscreen mode

(details: github.com/juliomalegria/python-cr...)

Looks like this has changed in the past few months, and I am now able to get the list of posts just with curl, so as you say using Puppeteer for this is overkill (and it is slow). But a few months ago my curl request would only return the HTML above.

Thread Thread
 
patryk__dev profile image
Patryk Maron

When you use curl, no Javascript is involved, and then it does not send the API request you are looking for.

Doing a fetch request with like Node.js, you can set your Agent Header or play around with Postman or similar and you will avoid no-js response

Thread Thread
 
moozzyk profile image
Pawel Kadluczka

HTML shown in the browser via View Page Source was the same. I found scripts that downloaded a bunch of cryptic JSON files and used them to build the DOM.

Collapse
 
patryk__dev profile image
Patryk Maron

Bonus to the above, I would remove zustand, and not store the data client side like that, then the component that display flights doesn't need to an client side component. We can achieve getting all the data with server components and make the app stateless that's reliant on the backend to get the data.

Collapse
 
bugger profile image
Vincent

actually if you play with meta search sites like kayak,expedia,etc for a long time, you'd probably find their webapp very tricky with such hacks. you'd either be blocked by cloudflare or ratelimited frequently. i am not saying using playwright would workaround completely but it does go through at a higher ratio.

Anyway, I found author's solution kind of great for a homelab showcase, though definitely needs lots of polishing for serious usage. Am I understanding right, Kishan?

Collapse
 
arjuncodess profile image
Arjun Vijay Prakash

This is a great in-depth article, brother! Thanks! 🙌

Collapse
 
kishansheth profile image
Kishan Sheth

Thanks  ❤️

Collapse
 
norbertoe888 profile image
Norberto Cáceres

How come jobs are never added to the importQueue?