Alex Aslam

Posted on Mar 13

Web Scraping with JavaScript and Playwright: A Modern Approach with Code Examples

#webdev #programming #javascript #beginners

Web scraping has evolved to tackle the challenges of modern web applications, where content is often loaded dynamically via JavaScript. Enter Playwright—a powerful, open-source automation library by Microsoft that simplifies scraping complex websites. Unlike older tools, Playwright supports Chromium, Firefox, and WebKit out of the box and handles SPAs, authentication, and even shadow DOMs with ease.

In this guide, you’ll learn how to scrape websites using JavaScript and Playwright, complete with practical code examples.

Why Playwright?

Cross-browser support: Scrape with Chromium, Firefox, or WebKit.
Auto-waiting: No more manual sleep() calls—Playwright waits for elements to load.
Mobile emulation: Test responsive sites or mimic mobile devices.
Stealth mode: Avoid bot detection with features like masking headless browsers.
Rich API: Handle file downloads, network interception, and more.

Setup

First, initialize a Node.js project and install Playwright:

npm init -y
npm install playwright

Basic Scraping: Extracting Data

Let’s scrape book titles and prices from a demo e-commerce site (https://books.toscrape.com).

const { chromium } = require('playwright');

(async () => {
  // Launch a headless browser
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  // Navigate to the target page
  await page.goto('https://books.toscrape.com');

  // Extract book titles and prices
  const books = await page.$$eval('.product_pod', (items) => {
    return items.map(item => ({
      title: item.querySelector('h3 a').getAttribute('title'),
      price: item.querySelector('.price_color').innerText,
    }));
  });

  console.log(books);
  await browser.close();
})();

Explanation:

chromium.launch() starts a headless browser instance.
page.$$eval() runs a function in the browser context to query DOM elements.
The selector .product_pod targets each book container, and nested queries extract the data.

Handling Dynamic Content

Modern sites often load data via AJAX or user interactions (e.g., clicking "Load More"). Playwright makes this straightforward:

const { firefox } = require('playwright');

(async () => {
  const browser = await firefox.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://example-infinite-scroll.com');

  // Scroll to the bottom repeatedly until no more content loads
  let previousHeight;
  while (true) {
    previousHeight = await page.evaluate('document.body.scrollHeight');
    await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
    await page.waitForTimeout(2000); // Wait for content to load
    const newHeight = await page.evaluate('document.body.scrollHeight');
    if (newHeight === previousHeight) break;
  }

  // Extract all loaded items
  const items = await page.$$eval('.item', elements => 
    elements.map(el => el.innerText)
  );

  console.log(`Loaded ${items.length} items.`);
  await browser.close();
})();

Advanced Techniques

1. Authentication & Sessions

Log into a site and reuse cookies for future sessions:

const { webkit } = require('playwright');

(async () => {
  const browser = await webkit.launch({ headless: false });
  const page = await browser.newPage();

  // Navigate to login page
  await page.goto('https://example.com/login');
  await page.fill('#username', 'user123');
  await page.fill('#password', 'pass123');
  await page.click('#submit');

  // Wait for login to complete
  await page.waitForNavigation();

  // Save cookies for reuse
  const cookies = await page.context().cookies();
  console.log('Cookies saved:', cookies);

  await browser.close();
})();

2. Avoiding Detection

Use Playwright’s stealth plugins to mimic human behavior:

const { chromium } = require('playwright');
const stealth = require('puppeteer-extra-plugin-stealth')();

(async () => {
  const browser = await chromium.launch({
    headless: false,
    args: ['--disable-blink-features=AutomationControlled']
  });
  const page = await browser.newPage();

  // Mask headless browser fingerprints
  await page.addInitScript(() => {
    delete navigator.webdriver;
  });

  await page.goto('https://example-protected-site.com');
  // ... proceed with scraping
})();

3. Intercepting Network Requests

Capture API responses to scrape data directly from XHR/Fetch calls:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  // Listen for network responses
  page.on('response', async (response) => {
    if (response.url().includes('/api/data')) {
      const data = await response.json();
      console.log('API Data:', data);
    }
  });

  await page.goto('https://example-spa.com');
  await browser.close();
})();

Best Practices

Rate Limiting: Use page.waitForTimeout() to space out requests.
Error Handling: Wrap actions in try/catch blocks.
Selectors: Prefer text= or role= selectors for reliability.
Headless Mode: Use headless: false for debugging.

Ethical Considerations

Respect robots.txt and website terms of service.
Avoid scraping personally identifiable information (PII).
Use proxies or rotating IPs to prevent overloading servers.

Conclusion

Playwright is a game-changer for web scraping, offering unparalleled flexibility for handling dynamic content, authentication, and anti-bot measures. With its intuitive API and cross-browser support, it’s a must-have tool in your scraping toolkit.

Next Steps:

Explore Playwright’s official documentation.
Build a price tracker or social media sentiment analyzer.

Call to Action

Got stuck? Check out Playwright’s debugging guide or drop a comment below!

JavaScript UI Libraries for Surveys and Forms

SurveyJS lets you build a JSON-based form management system that integrates with any backend, giving you full control over your data and no user limits. Includes support for custom question types, skip logic, integrated CCS editor, PDF export, real-time analytics & more.

Learn more

DEV Community

Web Scraping with JavaScript and Playwright: A Modern Approach with Code Examples

Why Playwright?

Setup

Basic Scraping: Extracting Data

Handling Dynamic Content

Advanced Techniques

1. Authentication & Sessions

2. Avoiding Detection

3. Intercepting Network Requests

Best Practices

Ethical Considerations

Conclusion

JavaScript UI Libraries for Surveys and Forms

Top comments (0)

The Next Generation Developer Platform

Read next

Linux File System Explained: Everything You Need to Know

A Lightweight & Smooth 5KB JavaScript Framework for Dynamic UI Experiences

Meet Docker Gordan AI.

Vibe Coding, Cursor Vs Windsurf, Firefox Did What?!

Okay