If you've ever tried to save a web article as a PDF, you know the pain. You hit Ctrl+P, and Chrome hands you a document with:
- The site's sticky navbar repeated on every page
- A cookie consent banner covering the first paragraph
- Ads wedged between paragraphs
- Lazy-loaded images rendered as blank rectangles
- An "article" that's 40% sidebar widgets and related links
I save 10-20 articles a day for research — policy documents, legal analyses, long-form journalism. After months of manually cleaning up garbage PDFs and uploading them to Google Drive, I decided to build something better.
The problem is deeper than it looks
My first instinct was to write a Python script with BeautifulSoup. Find the article container, strip the junk, pipe it through WeasyPrint, upload to Drive. Simple, right?
It wasn't. Every site uses different class names for their article body — .article-body, .post-content, .entry-content, .story-body — and that's just the English-language sites. I wrote 30+ selectors and still missed half the pages I needed. Government and legal sites were the worst.
Then there's the dynamic content problem. requests.get() returns the page before JavaScript runs. Lazy images, infinite scroll, client-rendered SPAs — all invisible to a server-side scraper. I tried Selenium, but running a headless browser just to grab an article felt like driving a truck to the corner store.
The real insight was that the user's browser already has everything I need. The page is fully rendered, images are loaded, the user is authenticated — a browser extension can just reach in and grab it.
What I built
Pooch PDF is a Chrome extension with three modes:
Quick PDF captures the full page as-is — a faithful snapshot of whatever you're looking at.
Clean PDF is the magic one. It extracts just the article content, strips all the clutter, and re-renders it as a properly formatted document with clean typography, intelligent page breaks, and an auto-generated table of contents. The output looks like it was typeset for print.
Save to Drive takes either mode and uploads the PDF to a dedicated folder in your Google Drive with a structured filename, searchable metadata, and an AI-generated abstract. One click from article to organized archive.
The hard parts
I won't go deep on implementation, but a few things surprised me:
Content extraction is a solved problem — in the browser. I wasted time on custom heuristics before discovering that Mozilla had already built exactly what I needed for Firefox Reader View. The challenge was adapting it to work within Chrome's extension architecture.
Chrome's Manifest V3 is hostile to PDF generation. Most JavaScript PDF libraries use patterns that MV3's security model treats as code injection. Getting a PDF library to pass Chrome Web Store review took more engineering than the rest of the extension combined. If you're building an MV3 extension that generates files, budget time for this.
Single-page PDFs are harder than they sound. Rendering a 15,000-word article as one continuous page (no breaks) requires measuring rendered content height precisely and working around PDF library dimension limits. Getting this right across different screen sizes took a lot of iteration.
On-device AI is real and useful. Chrome ships with an on-device summarizer. I use it to generate abstracts that get embedded as PDF metadata, making every saved article searchable in Drive months later. It's not available on all machines, so the extension has a graceful fallback chain.
Privacy as architecture
This was a conscious design choice, not a marketing angle. Every piece of the pipeline runs client-side:
- Content extraction: in the browser
- PDF generation: in the browser
- AI abstracts: on-device
- OAuth: Chrome's native API (no auth server)
- Storage: the user's own Google Drive
I don't run a single server for the core product. The only network calls are the Drive upload (when the user chooses it) and share link creation for Pro users. No article content ever leaves the browser.
This isn't just a privacy win — it means zero marginal cost per user. The free tier can be genuinely generous because it costs me nothing to provide.
Try it
Pooch PDF is on the Chrome Web Store. Works on Chrome, Edge, Brave, and Arc.
If you read long-form content and need to keep it — for research, legal work, academia, or just because you're a hoarder like me — give it a try. The free tier is real.
Top comments (0)