DEV Community

Cover image for Playwright Scraping infinite loading & pagination
DEVLOKER
DEVLOKER

Posted on

Playwright Scraping infinite loading & pagination

Before to start

Web data scraping itself isn’t illegal, but it can be illegal (or in a grey area) depending on these three things:

  • The type of data you are scraping (personal, copyrighted…)
  • How you plan to use the scraped data
  • How you extracted the data from the website

So, before to start scraping data make sure to double-check the plan to ensure you’re conducting both legal and ethical web scraping with these checks: Am I scraping personal or copyrighted data? Am I scraping data from behind a login? Am I violating the Terms and Conditions?

If your answers to all of these questions is No, then your web scraping is legal. However, if you answer Yes to any of them, then you should take a step back and do a full legal review of your web scraping to ensure you’re not scraping the web illegally.

Check out this for more info.

Introduction

Playwright is a powerful tool developed by Microsoft, it allows developers to write reliable end-to-end tests and perform browser automation tasks with ease. What sets Playwright apart is its ability to work seamlessly across multiple browsers (Chrome, Firefox, and WebKit), it provides a consistent and efficient way to interact with web pages, extract data, and automate repetitive tasks. Moreover, it supports various programming languages such as Node.js, Python, Java, and .NET, that’s making it a versatile choice for web scraping projects.
Whether you're scraping public data for analysis, building a web crawler, or automating manual workflows, Playwright has you covered.

Scraping process

In this post, we'll explore how to harness the power of Playwright for web scraping using JavaScript, you can find the full code on GitHub repository.
In our scenario will try to extract public products from an e-commerce website, note that the scraping is not explicitly prohibit in its terms of service, and in order to avoid a lot of requests that could impact the server resources will try to extract only the first three pages of products (just for learning purposes).
Detail the steps involved in scraping process:

  1. - Waiting for the page to load.
  2. - Scrolling down to load more content.
  3. - Scraping products from the current page.
  4. - Clicking on the next page link and repeating the process.
  5. - Save scraped data to (csv or json) file, and download photos.

scraping process

Coding using playwright

Let's dive in and discover how to use playwright to extract content from multiple pages by following the next page link.

1. scrollDownUntilNoMoreContent

This function is designed to scroll down a web page until no more content is loaded. Here's a breakdown of its functionality:

  • Presses the "End" key to scroll to the bottom of the page.
  • Waits for a search container element ".search .v-lazy:nth-child(4) .search-view-item" to appear on the page.
  • Scrolls down the page incrementally until the bottom is reached, using a timer-based approach (setInterval) to simulate scrolling behavior.
  • Waits for the page to reach a stable state with no network activity "networkidle".


/**
 * Scrolls down the page until no more content is loaded.
 * @param {Page} page - The Playwright page object.
 * @param {number} nbPage - The page number for logging purposes.
 */
const scrollDownUntilNoMoreContent = async (page, nbPage) => {
    log(`scroll down until no more content in page : ${nbPage}`);

    // Presses the "End" key to scroll to the bottom of the page.
    await page.keyboard.down("End");

    // Waits for the search container element to appear on the page.
    await page.waitForSelector(searchContainerPath);

    // Scrolls down the page until the bottom is reached.
    await page.waitForFunction(async () => {
        let totalHeight = 0,
            scrollHeight = 0;
        await new Promise((resolve, reject) => {
            const distance = 100,
                delay = 100;
            const timer = setInterval(() => {
                scrollHeight = document.body.scrollHeight;
                window.scrollBy(0, distance);
                totalHeight += distance;
                if (totalHeight >= scrollHeight) {
                    clearInterval(timer);
                    resolve(true);
                }
            }, delay);
        });
    });

    // Waits for the page to reach a stable state with no network activity.
    await page.waitForLoadState("networkidle");
};


Enter fullscreen mode Exit fullscreen mode

2. scrapePageContent

This function is responsible for scraping the content of a web page, particularly products, and extracting relevant information:

  • Calls the scrollDownUntilNoMoreContent function to ensure all content is loaded before scraping.
  • Utilizes Playwright's page.evaluate method to execute JavaScript code within the context of the page, extracting product data based on CSS selectors.
  • Constructs an array of objects containing product information, including ID, image URL, title, price,....
  • Returns the array of scraped product data.


/**
 * Scrapes the content of the current page.
 * @param {Page} page - The Playwright page object.
 * @param {number} nbPage - The page number for logging purposes.
 * @returns {Array} - An array containing the scraped products.
 */
const scrapePageContent = async (page, nbPage) => {
    // Scrolls down the page until no more content is loaded.
    await scrollDownUntilNoMoreContent(page, nbPage);

    // Logs the page URL being scraped.
    log(`Scraping content of page ${nbPage}: ${await page.url()}`);

    // Evaluates JavaScript in the context of the page to scrape product data.
    const currentPageProducts = await page.evaluate(
        /**
         * Extracts product information from the page.
         * @param {string} searchContainerPath - The CSS selector for the search items.
         * @returns {Array} - An array containing the scraped product data.
         */
        ({ searchContainerPath }) => {
            const items = document.querySelectorAll(searchContainerPath);
            return [...items].map((item) => {
                const product = item.querySelector(".v-card.o-announ-card");
                if (product) {
                    const imgEl = product.querySelector(
                        ".o-announ-card-image img"
                    );
                    const titleEl = product.querySelector(
                        "h3.o-announ-card-title"
                    );
                    const priceEl = product.querySelectorAll(
                        "span.price > span > div"
                    );
                    const infosEl = product.querySelectorAll(
                        "div.col.py-0.px-0.my-1 > span.v-chip"
                    );
                    const othersEl = product.querySelectorAll(
                        ".mb-1.d-flex.flex-column.flex-gap-1.line-height-1 > span"
                    );
                    return {
                        id: item.getAttribute("id"),
                        img: imgEl.getAttribute("src"),
                        title: titleEl.textContent.trim(),
                        price: {
                            value: priceEl?.[0]?.textContent.trim() ?? "",
                            unit: priceEl?.[1]?.textContent.trim() ?? "",
                        },
                        infos: Array.from(infosEl).map((info) =>
                            info.textContent.trim()
                        ),
                        state: othersEl?.[0]?.textContent.trim() ?? "",
                        timeAgo: othersEl?.[1]?.textContent.trim() ?? "",
                    };
                }
            });
        },
        { searchContainerPath }
    );

    // Logs the number of scraped results.
    log(`Scrap results: ${currentPageProducts.length}`);

    // Returns the scraped product data.
    return currentPageProducts;
};


Enter fullscreen mode Exit fullscreen mode

3. navigateToNextPage

This function is responsible for navigating to the next page of search results:

  • Waits for the next page link to appear on the page.
  • Locates the next page link element “.v-paginationnext” using the provided CSS selector.
  • If the next page link is found:
    • Clicks the link to navigate to the next page.
    • Waits for the search container element to appear, ensuring the page is fully loaded.
  • If the next page link is not found, it throws an error indicating the inability to navigate to the next page.


/**
 * Navigates to the next page of search results.
 * @param {Page} page - The Playwright page object.
 * @param {number} nbPage - The page number for logging purposes.
 * @throws {Error} - If unable to navigate to the next page.
 */
const navigateToNextPage = async (page, nbPage) => {
    // Waits for the next page link to appear on the page.
    await page.waitForSelector(nextPageLinkPath);

    // Locates the next page link element on the page.
    const nextPageLink = await page.locator(nextPageLinkPath);

    // Checks if the next page link is found.
    if (nextPageLink) {
        // Clicks the next page link and waits for the search container element to appear.
        await Promise.all([
            nextPageLink.click(),
            page.waitForSelector(searchContainerPath),
        ]);

        // Logs the navigation to the next page.
        log(`Navigating to the next page ${nbPage}: ${await page.url()}`);
    } else {
        // Throws an error if unable to navigate to the next page.
        throw new Error(
            `Can't navigate to the next page: ${nbPage}: ${await page.url()}`
        );
    }
};


Enter fullscreen mode Exit fullscreen mode

4. saveData

This function provides a convenient way to organize and store scraped data in multiple formats (JSON, CSV), facilitating further analysis and processing, it also allows optionally download images associated with the scraped products:

  • Creates a directory to store the results and images, ensuring it exists recursively.
  • Constructs the filename for the JSON and CSV files based on the product type.
  • Saves the scraped data to a JSON file using the saveToJsonFile function.
  • Saves the scraped data to a CSV file using the *saveToCsvFile * function.
  • If the withDownloadingImages flag is set to true:
    • Constructs the path for the images directory within the results directory.
    • Downloads images associated with the scraped products to the images directory using the downloadImages function.


/**
 * Saves scraped data to JSON, CSV, and optionally downloads images.
 * @param {string} productType - The type of product being saved.
 * @param {Array} products - An array containing the scraped product data.
 */
const saveData = async (productType, products) => {
    // Creates a directory to store results, including images.
    fs.mkdirSync(resultsDir, { recursive: true });
    const imagesPath = path.join(resultsDir, "images");
    fs.mkdirSync(imagesPath, { recursive: true });

    // Constructs the filename for JSON and CSV files.
    const filename = path.join(resultsDir, `${productType}`);

    // Saves scraped data to JSON file.
    saveToJsonFile(`${filename}.json`, products);

    // Saves scraped data to CSV file.
    await saveToCsvFile(path.join(`${filename}.csv`), products);

    // Optionally downloads images associated with the scraped data.
    if (withDownloadingImages) {
        await downloadImages(imagesPath, products);
    }
};


Enter fullscreen mode Exit fullscreen mode

5. Main

This is the main function, it orchestrates the web scraping process by performing the following tasks:

  • Launches a new Chromium browser instance and opens a new page in the browser context and navigates to URL products.
  • Retrieves the total number of pages to scrape from the pagination element on the page.
  • Iterates over each page to scrape content, accumulating products in an array.
  • Navigates to the next page if not the last page.
  • Saves scraped data to JSON, CSV, and optionally downloads images.
  • Closes the browser and exits the process.


/**
 * Main function to orchestrate the web scraping process.
 */
(async () => {
    // Launches a new Chromium browser instance.
    const browser = await chromium.launch({ headless: false }); // Consider making headless: true for production

    // Creates a new browser context.
    const context = await browser.newContext();

    // Opens a new page in the browser context.
    const page = await context.newPage();

    try {
        // Navigates to the specified URL and waits for the page to load.
        await page.goto(url);
        await page.waitForSelector(paginationPath);

        // Retrieves the total number of pages to scrape.
        const totalPages = await page.evaluate(
            /**
             * Retrieves the total number of pages from the pagination element.
             * @param {string} paginationPath - The CSS selector for the pagination element.
             * @returns {number} - The total number of pages to scrape.
             */
            ({ paginationPath }) => {
                const paginationElement =
                    document.querySelector(paginationPath);
                return paginationElement
                    ? Number(paginationElement.getAttribute("length"))
                    : 1;
            },
            { paginationPath }
        );

        // Logs the URL, number of pages to scrape, and separator.
        log(
            `
            #######################################################
            URL: ${url}
            Pages to scrape: ${totalPages}
            #######################################################
        `,
            false
        );

        // Array to store scraped products.
        let products = [];
        const nbPages = Math.min(numberOfPagesToScrape, totalPages);
        // Iterates over each page to scrape content.
        for (let i = 1; i <= nbPages; i++) {
            const currentPageProducts = await scrapePageContent(page, i);
            products.push(...currentPageProducts);
            // Navigates to the next page if not the last page.
            if (i < nbPages) {
                await navigateToNextPage(page, i);
            }
        }

        // Logs the total number of scraped products.
        log(
            `
            #######################################################
            Total scraped products : ${products.length}
            #######################################################
        `,
            false
        );

        // Saves scraped data to JSON, CSV, and optionally downloads images.
        await saveData(productType, products);
    } catch (error) {
        // Logs any errors that occur during scraping.
        console.error("Error while scraping:", error);
    } finally {
        // Closes the browser and exits the process.
        await browser.close();
        process.exit(1);
    }
})();


Enter fullscreen mode Exit fullscreen mode

You can find the full code on GitHub repository.

Conclusion

Web scraping with Playwright opens up a world of possibilities for developers seeking to extract data from the web efficiently and responsibly. By harnessing the power of Playwright's automation capabilities and cross-browser compatibility, developers can streamline their web scraping workflows and extract valuable insights from a wide range of websites, but it's essential to navigate this process with legal and ethical considerations in mind.

Top comments (0)