I used to think self-hosting Puppeteer was the right way to handle browser automation. Full control, no vendor lock-in, runs anywhere. What could go wrong?
Everything, it turns out.
The Setup That Seemed Fine
It started with a simple requirement: generate PDF invoices and take screenshots of rendered pages for a SaaS product. Puppeteer seemed like the obvious choice. Install it, spin up a headless Chrome, done.
The first week was great. Screenshots worked. PDFs rendered beautifully. I shipped it and moved on.
The First Sign
Three weeks later, my staging server ran out of memory.
I traced it back to Puppeteer. Specifically: Chrome processes that didn't get cleaned up properly when requests errored out. Each screenshot attempt spawned a browser instance. Error cases — network timeouts, bad URLs, malformed HTML — left zombie Chrome processes dangling in memory.
Fix: wrap everything in try/catch, always call browser.close(). Got it.
Except browser.close() doesn't always work when the browser crashes mid-render.
The Leak You Can't Patch
Here's the version of the code that was running in production:
// Self-hosted Puppeteer — looked fine, wasn't fine
const puppeteer = require('puppeteer');
async function takeScreenshot(url) {
const browser = await puppeteer.launch({
args: ['--no-sandbox', '--disable-setuid-sandbox'],
headless: true,
});
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
const screenshot = await page.screenshot({ type: 'png' });
await browser.close(); // ← doesn't run if page.goto throws
return screenshot;
}
// ^ Each crash = orphaned Chrome process eating 150-200MB RAM
// ^ No concurrency limiting = 5 simultaneous requests = OOM
// ^ 'networkidle2' means page with active WebSockets never resolves
The real problem: browser.close() is after the await that throws. When page.goto times out or throws, you never reach the close call. You can fix this with a try/finally block — but then you're also wrapping puppeteer.launch() in case that fails, and the code starts looking like defensive programming soup.
I spent two days hardening this. Added connection pools, process reaping, health checks. It worked better. Still leaked occasionally on very long pages.
CI Started Failing
Once Puppeteer was in the repo, CI became unpredictable.
Chromium binaries are large (~300MB). Different engineers had different Chromium versions installed locally. The CI container had yet another version. Screenshot tests that passed locally failed on CI because antialiasing differed between Chrome 119 and Chrome 121.
I added --font-render-hinting=none and a dozen other flags I found in GitHub issues. Some tests stabilized. New ones broke on the next Chromium bump.
The snapshot tests became a maintenance burden. Every dependency update was a gamble.
3am
The page that broke me: a customer's page had an embedded map that used long polling. networkidle2 requires 2 consecutive 500ms windows with no more than 2 network connections. A long-polling map never hits that threshold.
The screenshot job hung for exactly 30 seconds (timeout), then crashed. The next job started immediately. And hung. And crashed. Until the queue backed up, memory spiked, and the entire service went down.
I got paged at 3am. Fixed it by killing the service and restarting. Spent the next day adding per-request timeout enforcement outside of Puppeteer, because you can't fully trust Puppeteer's own timeouts.
The Question I Should Have Asked Earlier
Why am I running a browser process in my API server?
I needed screenshots and PDFs. I didn't need to operate browser infrastructure. Those are different problems.
Browser automation is genuinely hard infrastructure. Sandboxing, memory management, concurrency limits, Chromium version pinning, font rendering consistency — these are solved problems, but only if you dedicate real engineering time to them.
I wasn't building browser infrastructure. I was building a SaaS product.
What the API Version Looks Like
# SnapAPI — same result, 3 lines
curl -X POST https://api.opspawn.com/screenshot \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "format": "png"}'
# Returns: { "url": "https://...signed-s3-url...", "cached": false }
No Chromium binary in your repo. No zombie processes. No 3am pages about memory leaks. The API handles concurrency, sandboxing, and cleanup.
When Self-Hosting Still Makes Sense
I'm not saying Puppeteer is bad. It's excellent for what it's designed for:
- Scraping at volume where you need custom cookies, sessions, and JS execution with full control
- E2E testing where you want to test your app's actual rendering in CI
- Interactive automation — clicking, form-filling, navigating multi-step flows
For those use cases, you want the control. The complexity is justified.
For generating screenshots and PDFs as a feature of your product — one endpoint, one use case — you're paying the full complexity cost of browser infrastructure for a minor feature.
The Economics
The time I spent on Puppeteer memory fixes, CI stabilization, and on-call incidents added up to roughly 3 days of engineering time. That's before factoring in the mental load of "is the screenshot service leaking again?" during every deploy.
A hosted screenshot API costs less than $20/month at moderate volume. The math isn't close.
Where it tips the other way: high volume (thousands of screenshots/hour) where per-request pricing becomes expensive. At that scale, dedicated browser farm infrastructure makes sense — but that's a different engineering conversation than "I need to add screenshot export to my app."
What I Actually Built
After moving screenshots off to an API, I had some leftover browser infrastructure code sitting around. I cleaned it up, added PDF export, and open-sourced it as SnapAPI.
It's a hosted screenshot and PDF API that handles the Chromium lifecycle, retries, and output storage. Free trial, no credit card, simple REST API.
→ SnapAPI — free trial at opspawn.com/snapapi
If you're maintaining a Puppeteer setup for screenshot/PDF use cases and fighting the same battles I was, it might be worth a look.
If you've solved the Puppeteer memory leak problem a different way, I'm curious what approach worked for you — leave a comment.
Top comments (0)