DEV Community

Cover image for I built two web-page APIs with Playwright — screenshots/PDF + clean article extraction
Railse Xu
Railse Xu

Posted on

I built two web-page APIs with Playwright — screenshots/PDF + clean article extraction

I kept hitting two annoying needs in side projects:

  1. Turn a web page into a screenshot or PDF (link previews, thumbnails, archiving, reports).
  2. Pull the clean article text out of a page buried in ads and navigation (for LLMs/RAG, reader apps, content pipelines).

Existing services were either pricey or fiddly, so I built two small APIs with Playwright, put them together under Renderly, and listed them on RapidAPI. Here's the build + a few gotchas.

1. Screenshot & PDF API

Give it a URL, get a full-page screenshot (PNG/JPEG) or a PDF. The key is real Chromium, so modern CSS, web fonts, and JS-rendered content all show up.

browser = await pw.chromium.launch(args=["--no-sandbox", "--disable-dev-shm-usage"])
context = await browser.new_context(viewport={"width": 1280, "height": 800})
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
img = await page.screenshot(full_page=True, type="png")
pdf = await page.pdf(print_background=True)
Enter fullscreen mode Exit fullscreen mode

The differentiator: clean output for AI

Most cheap screenshot APIs choke on cookie banners and ads — and in 2026 a lot of screenshots are fed to vision models, where banners waste tokens and confuse layout. So I added block_cookie_banners to hide common consent banners (OneTrust, Cookiebot, Quantcast…), ads, and chat widgets. You can also hide_selectors and pass cookies/headers to capture pages behind a login.

2. Article Extraction API

Give it a URL, get the clean main content (Markdown / text / HTML) + title, author, word count. It renders with Chromium first, then extracts with trafilatura, so JS-loaded content works. Clean Markdown can cut LLM tokens ~70% in RAG pipelines.

A few gotchas

  1. Headless cold starts are slow (~30s on Fly auto-stop) → keep one machine always on.
  2. Playwright image + pip: the playwright package wasn't importable; add it to requirements explicitly.
  3. SSRF: reject private/loopback IPs from user-supplied URLs.
  4. Don't build billing yourself: RapidAPI handles keys + billing; verify X-RapidAPI-Proxy-Secret so only proxied requests get through.

Try it

Both have a free tier:

Feedback welcome — residential-proxy support, mobile viewports, and dark mode are on my list.

Top comments (0)