DEV Community

Cover image for Playwright Amazon Scraper: Products & Reviews (Javascript)

Posted on

3 1 1 1 1

Playwright Amazon Scraper: Products & Reviews (Javascript)

Image description

Web Automation and Data Collection with Playwright (Node.js Version)

Playwright is a library for testing and automating web pages, supporting browsers like Chromium, Firefox, and WebKit. Developed by Microsoft, it is efficient, reliable, and fast, enabling cross - browser web automation tasks.

Collecting Amazon Product Information with Playwright

We can use Playwright to simulate user behavior, such as visiting Amazon ( and crawling product information and reviews. By using CSS selectors or XPath, we can precisely locate web page elements and extract their text or attributes.

Example: Crawling the Amazon Best Sellers List

We will use Playwright to collect the international best - sellers list on Amazon. The steps are as follows:

  1. Visit the target page, for example:
  2. Select all book elements (with the class names a-section and a-spacing-base)
  3. Iterate through the book elements and extract information such as titles, prices, ratings, and the number of reviews

Deploying a Playwright Example on Leapcell

Playwright Deployment Example on Leapcell

This guide provides a streamlined approach to deploying Playwright tests on Leapcell. Follow the link above for a step-by-step tutorial.

Node.js Implementation Code

The following is the implementation of data collection using Node.js and Playwright:

const { chromium } = require('playwright');

(async () => {
    // Launch the browser
    const browser = await chromium.launch({ headless: true });
    const context = await browser.newContext();
    const page = await context.newPage();

    // Visit the Amazon search page
    await page.goto('');

    // Search for the keyword "laptop"
    await page.fill('#twotabsearchtextbox', 'laptop');

    // Wait for the page to finish loading
    await page.waitForLoadState('networkidle');

    // Get the list of product links
    const links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.s-result-item h2 a'))
            .map(a => a.href);

    // Collect product details data
    const results = [];
    for (const link of links) {
        const productPage = await context.newPage();
        await productPage.goto(link, { waitUntil: 'networkidle' });

        const title = await productPage.textContent('#productTitle');
        const rating = await productPage.textContent('#averageCustomerReviews .a-icon-alt').catch(() => 'N/A');
        const reviewCount = await productPage.textContent('#acrCustomerReviewText').catch(() => 'N/A');

        results.push({ title: title.trim(), rating, reviewCount });

        await productPage.close();

    // Output the collected data

    // Close the browser
    await browser.close();
Enter fullscreen mode Exit fullscreen mode

Code Analysis

  • Initializing Playwright: Use chromium.launch({ headless: true }) to launch the browser.
  • Navigating to the Amazon Search Page: Use page.goto() to visit the website, fill in the search box, and submit the search.
  • Extracting Product Links: Use document.querySelectorAll() to get the URLs of all products.
  • Collecting Product Details:
    • Open each product's page.
    • Get the product title (#productTitle).
    • Get the rating (#averageCustomerReviews .a-icon-alt).
    • Get the number of reviews (#acrCustomerReviewText).
  • Outputting Data and Closing the Browser

Code Optimization

  1. Error Handling: Some products may not have ratings or review counts. Use .catch(() => 'N/A') to prevent the code from crashing.
  2. Automation Efficiency: Use await context.newPage() to reuse the context and improve page loading speed.
  3. Avoiding Being Blocked:
    • You can use proxy access (such as Playwright's proxy option).
    • You can adjust the userAgent to make it more like a real user.

Using Playwright and Node.js, we can efficiently automate Amazon web page data collection, which is suitable for scenarios such as e - commerce data analysis and competitor research.

Leapcell: The Next - Gen Serverless Platform for Web Hosting, Async Tasks, and Redis

Image description

Finally, I would like to recommend the best platform for deploying Playwright: Leapcell

1. Multi - Language Support

  • Develop with JavaScript, Python, Go, or Rust.

2. Deploy unlimited projects for free

  • pay only for usage — no requests, no charges.

3. Unbeatable Cost Efficiency

  • Pay - as - you - go with no idle charges.
  • Example: $25 supports 6.94M requests at a 60ms average response time.

4. Streamlined Developer Experience

  • Intuitive UI for effortless setup.
  • Fully automated CI/CD pipelines and GitOps integration.
  • Real - time metrics and logging for actionable insights.

5. Effortless Scalability and High Performance

  • Auto - scaling to handle high concurrency with ease.
  • Zero operational overhead — just focus on building.

Image description

Explore more in the documentation!

Leapcell Twitter:

Top comments (0)