DEV Community

Alex Aslam
Alex Aslam

Posted on

1

Web Scraping with JavaScript and Playwright: A Modern Approach with Code Examples

Web scraping has evolved to tackle the challenges of modern web applications, where content is often loaded dynamically via JavaScript. Enter Playwright—a powerful, open-source automation library by Microsoft that simplifies scraping complex websites. Unlike older tools, Playwright supports Chromium, Firefox, and WebKit out of the box and handles SPAs, authentication, and even shadow DOMs with ease.

In this guide, you’ll learn how to scrape websites using JavaScript and Playwright, complete with practical code examples.


Why Playwright?

  • Cross-browser support: Scrape with Chromium, Firefox, or WebKit.
  • Auto-waiting: No more manual sleep() calls—Playwright waits for elements to load.
  • Mobile emulation: Test responsive sites or mimic mobile devices.
  • Stealth mode: Avoid bot detection with features like masking headless browsers.
  • Rich API: Handle file downloads, network interception, and more.

Setup

First, initialize a Node.js project and install Playwright:

npm init -y
npm install playwright
Enter fullscreen mode Exit fullscreen mode

Basic Scraping: Extracting Data

Let’s scrape book titles and prices from a demo e-commerce site (https://books.toscrape.com).

const { chromium } = require('playwright');

(async () => {
  // Launch a headless browser
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  // Navigate to the target page
  await page.goto('https://books.toscrape.com');

  // Extract book titles and prices
  const books = await page.$$eval('.product_pod', (items) => {
    return items.map(item => ({
      title: item.querySelector('h3 a').getAttribute('title'),
      price: item.querySelector('.price_color').innerText,
    }));
  });

  console.log(books);
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • chromium.launch() starts a headless browser instance.
  • page.$$eval() runs a function in the browser context to query DOM elements.
  • The selector .product_pod targets each book container, and nested queries extract the data.

Handling Dynamic Content

Modern sites often load data via AJAX or user interactions (e.g., clicking "Load More"). Playwright makes this straightforward:

const { firefox } = require('playwright');

(async () => {
  const browser = await firefox.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://example-infinite-scroll.com');

  // Scroll to the bottom repeatedly until no more content loads
  let previousHeight;
  while (true) {
    previousHeight = await page.evaluate('document.body.scrollHeight');
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // Wait for content to load
    const newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === previousHeight) break;
  }

  // Extract all loaded items
  const items = await page.$$eval('.item', elements => 
    elements.map(el => el.innerText)
  );

  console.log(`Loaded ${items.length} items.`);
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Advanced Techniques

1. Authentication & Sessions

Log into a site and reuse cookies for future sessions:

const { webkit } = require('playwright');

(async () => {
  const browser = await webkit.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to login page
  await page.goto('https://example.com/login');
  await page.fill('#username', 'user123');
  await page.fill('#password', 'pass123');
  await page.click('#submit');

  // Wait for login to complete
  await page.waitForNavigation();

  // Save cookies for reuse
  const cookies = await page.context().cookies();
  console.log('Cookies saved:', cookies);

  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

2. Avoiding Detection

Use Playwright’s stealth plugins to mimic human behavior:

const { chromium } = require('playwright');
const stealth = require('puppeteer-extra-plugin-stealth')();

(async () => {
  const browser = await chromium.launch({
    headless: false,
    args: ['--disable-blink-features=AutomationControlled']
  });
  const page = await browser.newPage();

  // Mask headless browser fingerprints
  await page.addInitScript(() => {
    delete navigator.webdriver;
  });

  await page.goto('https://example-protected-site.com');
  // ... proceed with scraping
})();
Enter fullscreen mode Exit fullscreen mode

3. Intercepting Network Requests

Capture API responses to scrape data directly from XHR/Fetch calls:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // Listen for network responses
  page.on('response', async (response) => {
    if (response.url().includes('/api/data')) {
      const data = await response.json();
      console.log('API Data:', data);
    }
  });

  await page.goto('https://example-spa.com');
  await browser.close();
})();
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Rate Limiting: Use page.waitForTimeout() to space out requests.
  2. Error Handling: Wrap actions in try/catch blocks.
  3. Selectors: Prefer text= or role= selectors for reliability.
  4. Headless Mode: Use headless: false for debugging.

Ethical Considerations

  • Respect robots.txt and website terms of service.
  • Avoid scraping personally identifiable information (PII).
  • Use proxies or rotating IPs to prevent overloading servers.

Conclusion

Playwright is a game-changer for web scraping, offering unparalleled flexibility for handling dynamic content, authentication, and anti-bot measures. With its intuitive API and cross-browser support, it’s a must-have tool in your scraping toolkit.

Next Steps:


Call to Action

Got stuck? Check out Playwright’s debugging guide or drop a comment below!

SurveyJS custom survey software

JavaScript UI Libraries for Surveys and Forms

SurveyJS lets you build a JSON-based form management system that integrates with any backend, giving you full control over your data and no user limits. Includes support for custom question types, skip logic, integrated CCS editor, PDF export, real-time analytics & more.

Learn more

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay