Scraping Smarter with Python, Playwright 1.53, and SQLite
A practical, copy‑paste‑ready guide to building a headless scraper that survives modern websites.
Why Playwright?
Playwright's auto‑waiting, cross‑browser coverage, and steady monthly releases make it a rock‑solid bet for production scraping in 2025. Version 1.53 added helpful upgrades such as partitioned cookies and improved HTML report controls.
What we'll build
Launch Chromium in headless mode
Visit a list of URLs
Extract the page title and any email strings
Store results in an SQLite database
Run everything concurrently with
asyncio
for speed
Prerequisites
python -m venv venv && source venv/bin/activate # Windows: .\venv\Scripts\activate
pip install playwright aiosqlite
playwright install
Project structure
scraper/
├── scraper.py
└── scraped.db # created automatically
The code
# scraper.py
import asyncio
import re
from pathlib import Path
from playwright.async_api import async_playwright
import aiosqlite
URLS = [
"https://example.com",
"https://python.org",
# add more...
]
EMAIL_RE = re.compile(r"[A-Za-z0-9_.+-]+@[A-Za-z0-9-]+\.[A-Za-z0-9-.]+")
DB_PATH = Path("scraped.db")
async def save_result(db, url, title, emails):
await db.execute(
"INSERT INTO results (url, title, emails) VALUES (?, ?, ?)",
(url, title, ",".join(emails)),
)
await db.commit()
async def scrape_page(page, url):
await page.goto(url, timeout=30_000)
await page.wait_for_load_state("networkidle")
html = await page.content()
title = await page.title()
emails = EMAIL_RE.findall(html)
return title, set(emails)
async def worker(playwright, db, url):
browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context(
locale="en-US",
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
java_script_enabled=True,
)
page = await context.new_page()
try:
title, emails = await scrape_page(page, url)
await save_result(db, url, title, emails)
print(f"[+] {url} -> {title} ({len(emails)} emails)")
except Exception as exc:
print(f"[!] {url} failed: {exc}")
finally:
await context.close()
await browser.close()
async def main():
async with aiosqlite.connect(DB_PATH) as db:
await db.execute(
"""
CREATE TABLE IF NOT EXISTS results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
title TEXT,
emails TEXT
)
"""
)
await db.commit()
async with async_playwright() as pw:
tasks = [worker(pw, db, url) for url in URLS]
await asyncio.gather(*tasks)
if __name__ == "__main__":
asyncio.run(main())
Key techniques explained
1. Async with isolated browsers
Each task launches a fresh browser context, avoiding shared cookies and localStorage issues. Concurrency is limited only by CPU and RAM.
2. Partitioned cookies
If you scrape several sites that inspect document.cookie
, add the partitionKey
field (shown in the code) to hide cross‑site cookies.
3. Auto‑waiting
page.goto(...); page.wait_for_load_state("networkidle")
removes the need for sleep()
calls and prevents empty screenshots.
4. SQLite for quick persistence
No server and no ORM. For larger volumes, swap in Postgres with asyncpg while keeping the rest unchanged.
Hardening your scraper
CAPTCHA fallback -- detect common CAPTCHA selectors and queue those URLs for manual review or solve with an API
Retry logic -- wrap
scrape_page
in exponential backoffProxy rotation -- inject
proxy={"server": "...", "username": "...", "password": "..."}
intolaunch()
Headful debugging -- set
headless=False
and addslow_mo=50
during development
Scaling up
Playwright runs in a single process, so true horizontal scaling means spawning multiple Python workers or using containers. Official Docker images stay in sync with each Playwright release.
Where to go next
Build a CLI wrapper that reads targets from a CSV
Store screenshots with
page.screenshot()
for quick visual diffingExport to JSON and pipe into an Elastic or ClickHouse cluster for fast querying
Conclusion
With fewer than 100 lines of clean Python, you now have a concurrent, headless scraper that:
Handles JavaScript‑heavy sites
Avoids third‑party tracking through partitioned cookies
Writes durable results to SQLite
Fork it, tweak it, and publish something cool on dev.to. Happy scraping!
Top comments (0)