DEV Community

Grace Evans
Grace Evans

Posted on

Scraping Smarter with Python, Playwright 1.53, and SQLite

Scraping Smarter with Python, Playwright 1.53, and SQLite

A practical, copy‑paste‑ready guide to building a headless scraper that survives modern websites.


Why Playwright?

Playwright's auto‑waiting, cross‑browser coverage, and steady monthly releases make it a rock‑solid bet for production scraping in 2025. Version 1.53 added helpful upgrades such as partitioned cookies and improved HTML report controls.


What we'll build

  1. Launch Chromium in headless mode

  2. Visit a list of URLs

  3. Extract the page title and any email strings

  4. Store results in an SQLite database

  5. Run everything concurrently with asyncio for speed


Prerequisites

python -m venv venv && source venv/bin/activate   # Windows: .\venv\Scripts\activate
pip install playwright aiosqlite
playwright install

Enter fullscreen mode Exit fullscreen mode

Project structure

scraper/
├── scraper.py
└── scraped.db      # created automatically

Enter fullscreen mode Exit fullscreen mode

The code

# scraper.py
import asyncio
import re
from pathlib import Path
from playwright.async_api import async_playwright
import aiosqlite

URLS = [
    "https://example.com",
    "https://python.org",
    # add more...
]

EMAIL_RE = re.compile(r"[A-Za-z0-9_.+-]+@[A-Za-z0-9-]+\.[A-Za-z0-9-.]+")
DB_PATH = Path("scraped.db")

async def save_result(db, url, title, emails):
    await db.execute(
        "INSERT INTO results (url, title, emails) VALUES (?, ?, ?)",
        (url, title, ",".join(emails)),
    )
    await db.commit()

async def scrape_page(page, url):
    await page.goto(url, timeout=30_000)
    await page.wait_for_load_state("networkidle")
    html = await page.content()
    title = await page.title()
    emails = EMAIL_RE.findall(html)
    return title, set(emails)

async def worker(playwright, db, url):
    browser = await playwright.chromium.launch(headless=True)
    context = await browser.new_context(
        locale="en-US",
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
        java_script_enabled=True,
    )

    page = await context.new_page()
    try:
        title, emails = await scrape_page(page, url)
        await save_result(db, url, title, emails)
        print(f"[+] {url} -> {title} ({len(emails)} emails)")
    except Exception as exc:
        print(f"[!] {url} failed: {exc}")
    finally:
        await context.close()
        await browser.close()

async def main():
    async with aiosqlite.connect(DB_PATH) as db:
        await db.execute(
            """
            CREATE TABLE IF NOT EXISTS results (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT,
                title TEXT,
                emails TEXT
            )
            """
        )
        await db.commit()

        async with async_playwright() as pw:
            tasks = [worker(pw, db, url) for url in URLS]
            await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

Enter fullscreen mode Exit fullscreen mode

Key techniques explained

1. Async with isolated browsers

Each task launches a fresh browser context, avoiding shared cookies and localStorage issues. Concurrency is limited only by CPU and RAM.

2. Partitioned cookies

If you scrape several sites that inspect document.cookie, add the partitionKey field (shown in the code) to hide cross‑site cookies.

3. Auto‑waiting

page.goto(...); page.wait_for_load_state("networkidle") removes the need for sleep() calls and prevents empty screenshots.

4. SQLite for quick persistence

No server and no ORM. For larger volumes, swap in Postgres with asyncpg while keeping the rest unchanged.


Hardening your scraper

  • CAPTCHA fallback -- detect common CAPTCHA selectors and queue those URLs for manual review or solve with an API

  • Retry logic -- wrap scrape_page in exponential backoff

  • Proxy rotation -- inject proxy={"server": "...", "username": "...", "password": "..."} into launch()

  • Headful debugging -- set headless=False and add slow_mo=50 during development


Scaling up

Playwright runs in a single process, so true horizontal scaling means spawning multiple Python workers or using containers. Official Docker images stay in sync with each Playwright release.


Where to go next

  • Build a CLI wrapper that reads targets from a CSV

  • Store screenshots with page.screenshot() for quick visual diffing

  • Export to JSON and pipe into an Elastic or ClickHouse cluster for fast querying


Conclusion

With fewer than 100 lines of clean Python, you now have a concurrent, headless scraper that:

  • Handles JavaScript‑heavy sites

  • Avoids third‑party tracking through partitioned cookies

  • Writes durable results to SQLite

Fork it, tweak it, and publish something cool on dev.to. Happy scraping!

Top comments (0)