DEV Community

agenthustler
agenthustler

Posted on

Playwright Web Scraping Tutorial in 2026: JavaScript-Rendered Pages Made Easy

Playwright has quickly become the go-to tool for scraping JavaScript-rendered pages in 2026. If you've been wrestling with Selenium or hitting walls with requests + BeautifulSoup on SPAs, this tutorial will show you why Playwright is worth the switch — and how to use it effectively.

Why Playwright for Web Scraping?

Playwright is an open-source browser automation library built by Microsoft. Compared to Selenium, it offers several meaningful advantages for scraping:

Feature Playwright Selenium
Auto-wait Built-in, smart Manual sleeps required
Async support Native asyncio Bolted on
Browser contexts Lightweight isolation Full browser per session
Speed Faster Slower
Network interception First-class Limited

The auto-wait feature alone saves hours of debugging flaky scrapers. Playwright waits for elements to be visible and actionable before interacting — no more time.sleep(3) guesswork.

Setup

Install Playwright and its browser binaries:

pip install playwright
playwright install
Enter fullscreen mode Exit fullscreen mode

This downloads Chromium, Firefox, and WebKit. For most scraping tasks, Chromium is the default.

Here's the basic async Python setup:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")
        print(await page.title())
        await browser.close()

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Basic Scraping Example

Let's extract the page title and some text content from a real page:

import asyncio
from playwright.async_api import async_playwright

async def scrape_basic():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto("https://news.ycombinator.com")

        # Get the page title
        title = await page.title()
        print(f"Title: {title}")

        # Extract all story titles
        stories = await page.query_selector_all(".titleline > a")
        for story in stories[:10]:
            text = await story.inner_text()
            href = await story.get_attribute("href")
            print(f"- {text}: {href}")

        await browser.close()

asyncio.run(scrape_basic())
Enter fullscreen mode Exit fullscreen mode

This gives you structured data from a server-rendered page. But the real power comes with JavaScript-heavy sites.

Handling JavaScript-Rendered Pages

Static HTML scrapers break on React, Vue, and Angular apps because the content is injected by JavaScript after page load. Playwright handles this natively.

Consider scraping a product listing page built with React:

import asyncio
from playwright.async_api import async_playwright

async def scrape_spa_products():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto("https://fakestoreapi.com/products", wait_until="networkidle")

        # Wait for a specific selector to appear
        await page.wait_for_selector(".product-card", timeout=10000)

        # Now extract the dynamically loaded content
        products = await page.query_selector_all(".product-card")
        results = []

        for product in products:
            name = await product.query_selector(".product-name")
            price = await product.query_selector(".product-price")

            results.append({
                "name": await name.inner_text() if name else "N/A",
                "price": await price.inner_text() if price else "N/A"
            })

        print(f"Found {len(results)} products")
        for r in results:
            print(r)

        await browser.close()

asyncio.run(scrape_spa_products())
Enter fullscreen mode Exit fullscreen mode

Key methods for JS-heavy pages:

  • wait_until="networkidle" — wait until no network requests for 500ms
  • wait_for_selector() — wait for a CSS selector to appear in DOM
  • wait_for_load_state("domcontentloaded") — lighter than networkidle

For infinite scroll, you can trigger it programmatically:

# Scroll down to trigger lazy loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1000)  # brief pause for content to load
Enter fullscreen mode Exit fullscreen mode

Taking Screenshots

Screenshots are invaluable for debugging scrapers and visual verification:

async def capture_screenshots():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")

        # Full page screenshot
        await page.screenshot(path="full_page.png", full_page=True)

        # Screenshot of a specific element
        element = await page.query_selector("h1")
        if element:
            await element.screenshot(path="heading.png")

        await browser.close()

asyncio.run(capture_screenshots())
Enter fullscreen mode Exit fullscreen mode

This is especially useful when your selectors stop matching — a screenshot tells you exactly what the page looks like at scrape time.

Intercepting Network Requests

This is the technique that separates beginner scrapers from pro ones. Most modern SPAs fetch data from a JSON API. Instead of parsing the rendered DOM, you can intercept those API calls directly.

import asyncio
import json
from playwright.async_api import async_playwright

async def intercept_api():
    captured_data = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        # Set up request interception BEFORE navigating
        async def handle_response(response):
            if "api/products" in response.url and response.status == 200:
                try:
                    data = await response.json()
                    captured_data.extend(data)
                    print(f"Captured {len(data)} items from {response.url}")
                except Exception:
                    pass

        page.on("response", handle_response)

        # Navigate — the page will trigger API calls automatically
        await page.goto("https://your-target-site.com/products",
                        wait_until="networkidle")

        print(f"Total items captured: {len(captured_data)}")
        print(json.dumps(captured_data[:2], indent=2))

        await browser.close()

asyncio.run(intercept_api())
Enter fullscreen mode Exit fullscreen mode

The advantage: API responses are clean JSON, not messy HTML. No brittle CSS selectors. This approach is also significantly faster since you're not parsing the DOM.

You can also block requests you don't need to speed things up:

async def handle_route(route):
    if route.request.resource_type in ["image", "stylesheet", "font"]:
        await route.abort()
    else:
        await route.continue_()

await page.route("**/*", handle_route)
Enter fullscreen mode Exit fullscreen mode

Handling Authentication & Cookies

Many scraping targets require authentication. Playwright lets you save and restore browser state so you don't have to log in on every run:

import asyncio
from playwright.async_api import async_playwright

# First run: log in and save state
async def save_auth():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)  # visible for login
        page = await browser.new_page()

        await page.goto("https://example.com/login")
        await page.fill("#email", "your@email.com")
        await page.fill("#password", "yourpassword")
        await page.click("[type=submit]")

        # Wait for redirect after login
        await page.wait_for_url("**/dashboard")

        # Save auth state to file
        await page.context.storage_state(path="auth_state.json")
        print("Auth state saved")
        await browser.close()

# Subsequent runs: restore state
async def scrape_authenticated():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        # Load saved auth state — no login needed
        context = await browser.new_context(storage_state="auth_state.json")
        page = await context.new_page()

        await page.goto("https://example.com/protected-page")
        # You're already logged in
        content = await page.inner_text("main")
        print(content[:500])

        await browser.close()

asyncio.run(scrape_authenticated())
Enter fullscreen mode Exit fullscreen mode

This is clean and reliable for sites with JWT tokens, session cookies, or OAuth flows.

Scaling Up

When moving from a one-off script to a production scraper, these patterns matter:

Use browser contexts, not new browsers:

# Efficient: one browser, many lightweight contexts
browser = await p.chromium.launch(headless=True)

async def scrape_url(url):
    context = await browser.new_context()
    page = await context.new_page()
    await page.goto(url)
    data = await page.inner_text("body")
    await context.close()
    return data

# Run multiple contexts concurrently
results = await asyncio.gather(*[scrape_url(url) for url in urls])
Enter fullscreen mode Exit fullscreen mode

Block unnecessary resources:

await page.route("**/*.{png,jpg,gif,svg,woff,woff2}", lambda r: r.abort())
Enter fullscreen mode Exit fullscreen mode

Set realistic timeouts:

page = await context.new_page()
page.set_default_timeout(15000)  # 15 seconds max per action
Enter fullscreen mode Exit fullscreen mode

Rotate user agents:

context = await browser.new_context(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
Enter fullscreen mode Exit fullscreen mode

When to Use Managed Solutions

Running Playwright at scale on your own infrastructure means dealing with:

  • IP bans and CAPTCHAs
  • Proxy rotation and residential IPs
  • Browser fingerprinting
  • Infrastructure maintenance

For production workloads, managed services take the ops burden off your plate:

ScrapeOps is a scraping operations platform that handles proxy rotation, scheduling, monitoring, and alerting. It integrates cleanly with Playwright scrapers and gives you observability across all your scraping jobs — useful once you're running dozens of scrapers.

ThorData provides residential proxies sourced from real devices, which are significantly harder for anti-bot systems to detect than datacenter IPs. If you're hitting blocks on major e-commerce or social platforms, residential proxies are often the fix.

Apify actors let you run managed Playwright scrapers in the cloud without managing infrastructure. Apify handles browser rendering, scheduling, and output storage — you just write your scraping logic. Good option if you want serverless scale without the DevOps overhead.

Conclusion

Playwright is the right tool for modern web scraping in 2026. The auto-wait behavior eliminates most flakiness, async support makes concurrent scraping clean, and network interception is a genuinely powerful technique that most scrapers overlook.

The path from prototype to production typically goes:

  1. Playwright script locally → works on the target
  2. Add error handling, retries, and logging
  3. Containerize and schedule
  4. Add proxy rotation when you hit IP limits
  5. Move to managed infrastructure for scale

Start with the basics from this tutorial, and reach for managed solutions when the ops complexity starts costing more than the service fees. Happy scraping.

Top comments (0)