Grace Evans

Posted on Jul 21

Scraping Smarter with Python, Playwright 1.53, and SQLite

#python #playwright #sqlite

Scraping Smarter with Python, Playwright 1.53, and SQLite

A practical, copy‑paste‑ready guide to building a headless scraper that survives modern websites.

Why Playwright?

Playwright's auto‑waiting, cross‑browser coverage, and steady monthly releases make it a rock‑solid bet for production scraping in 2025. Version 1.53 added helpful upgrades such as partitioned cookies and improved HTML report controls.

What we'll build

Launch Chromium in headless mode
Visit a list of URLs
Extract the page title and any email strings
Store results in an SQLite database
Run everything concurrently with asyncio for speed

Prerequisites

python -m venv venv && source venv/bin/activate   # Windows: .\venv\Scripts\activate
pip install playwright aiosqlite
playwright install

Project structure

scraper/
├── scraper.py
└── scraped.db      # created automatically

The code

# scraper.py
import asyncio
import re
from pathlib import Path
from playwright.async_api import async_playwright
import aiosqlite

URLS = [
    "https://example.com",
    "https://python.org",
    # add more...
]

EMAIL_RE = re.compile(r"[A-Za-z0-9_.+-]+@[A-Za-z0-9-]+\.[A-Za-z0-9-.]+")
DB_PATH = Path("scraped.db")

async def save_result(db, url, title, emails):
    await db.execute(
        "INSERT INTO results (url, title, emails) VALUES (?, ?, ?)",
        (url, title, ",".join(emails)),
    )
    await db.commit()

async def scrape_page(page, url):
    await page.goto(url, timeout=30_000)
    await page.wait_for_load_state("networkidle")
    html = await page.content()
    title = await page.title()
    emails = EMAIL_RE.findall(html)
    return title, set(emails)

async def worker(playwright, db, url):
    browser = await playwright.chromium.launch(headless=True)
    context = await browser.new_context(
        locale="en-US",
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        java_script_enabled=True,
    )

    page = await context.new_page()
    try:
        title, emails = await scrape_page(page, url)
        await save_result(db, url, title, emails)
        print(f"[+] {url} -> {title} ({len(emails)} emails)")
    except Exception as exc:
        print(f"[!] {url} failed: {exc}")
    finally:
        await context.close()
        await browser.close()

async def main():
    async with aiosqlite.connect(DB_PATH) as db:
        await db.execute(
            """
            CREATE TABLE IF NOT EXISTS results (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT,
                title TEXT,
                emails TEXT
            )
            """
        )
        await db.commit()

        async with async_playwright() as pw:
            tasks = [worker(pw, db, url) for url in URLS]
            await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

Key techniques explained

1. Async with isolated browsers

Each task launches a fresh browser context, avoiding shared cookies and localStorage issues. Concurrency is limited only by CPU and RAM.

2. Partitioned cookies

If you scrape several sites that inspect document.cookie, add the partitionKey field (shown in the code) to hide cross‑site cookies.

3. Auto‑waiting

page.goto(...); page.wait_for_load_state("networkidle") removes the need for sleep() calls and prevents empty screenshots.

4. SQLite for quick persistence

No server and no ORM. For larger volumes, swap in Postgres with asyncpg while keeping the rest unchanged.

Hardening your scraper

CAPTCHA fallback -- detect common CAPTCHA selectors and queue those URLs for manual review or solve with an API
Retry logic -- wrap scrape_page in exponential backoff
Proxy rotation -- inject proxy={"server": "...", "username": "...", "password": "..."} into launch()
Headful debugging -- set headless=False and add slow_mo=50 during development

Scaling up

Playwright runs in a single process, so true horizontal scaling means spawning multiple Python workers or using containers. Official Docker images stay in sync with each Playwright release.

Where to go next

Build a CLI wrapper that reads targets from a CSV
Store screenshots with page.screenshot() for quick visual diffing
Export to JSON and pipe into an Elastic or ClickHouse cluster for fast querying

Conclusion

With fewer than 100 lines of clean Python, you now have a concurrent, headless scraper that:

Handles JavaScript‑heavy sites
Avoids third‑party tracking through partitioned cookies
Writes durable results to SQLite

Fork it, tweak it, and publish something cool on dev.to. Happy scraping!

DEV Community

Scraping Smarter with Python, Playwright 1.53, and SQLite

Scraping Smarter with Python, Playwright 1.53, and SQLite

Why Playwright?

What we'll build

Prerequisites

Project structure

The code

Key techniques explained

1. Async with isolated browsers

2. Partitioned cookies

3. Auto‑waiting

4. SQLite for quick persistence

Hardening your scraper

Scaling up

Where to go next

Conclusion

Top comments (0)

Scraping Smarter with Python, Playwright 1.53, and SQLite

Why Playwright?

What we'll build

Prerequisites

Project structure

The code

Key techniques explained

1. Async with isolated browsers

2. Partitioned cookies

3. Auto‑waiting

4. SQLite for quick persistence

Hardening your scraper

Scaling up

Where to go next

Conclusion

Scraping Smarter with Python, Playwright 1.53, and SQLite