Arham_Q

Posted on Apr 24 • Edited on Apr 25

I Built a PDF Toolkit as a Student (And Deployed It for Free)

#flask #webdev #python

Flask, PyMuPDF, Groq, and a lot of jugaad engineering

Every student has been there. It's 11 PM. You need to compress a PDF before submitting it, convert a JPEG to PDF for a form, or quickly summarize a 40-page document before an exam. You open some sketchy website, it watermarks your file, asks you to pay, and uploads your documents to who-knows-where.

I got tired of it. So I built my own.

DocFlask is an all-in-one document toolkit built with Flask. It handles PDF merging, splitting, conversion, compression, image conversion, and even AI-powered summarization and quiz generation — all for free, hosted on the internet for other broke students.

Here's the honest story of how it got built, the problems I ran into, and the "good enough" solutions I used to ship it anyway.

What It Does

Before the war stories, here's the feature set:

Merge & Split PDFs — combine multiple PDFs or split by page ranges
Convert — PDF ↔ DOCX and DOCX ↔ PDF
Compress — reduce file size for both PDFs and DOCX files
AI Summarize — structured summary from any PDF
Quiz Generator — flashcards and MCQs generated from PDF content
JPEG to PDF — batch convert up to 30 images into one PDF
Image Convert — JPEG ↔ PNG with alpha-safe handling

The Stack

Backend:     Flask
PDF engine:  PyMuPDF (fitz)
PDF→DOCX:    pdf2docx
DOCX→PDF:    python-docx + fpdf2
NLP:         sumy + nltk
AI:          Groq API
Images:      Pillow
Frontend:    Jinja2 + Vanilla JS + Tailwind CDN
Hosting:     Vercel (free)

Nothing fancy. No Docker, no Celery, no Redis. Just Flask doing Flask things.

Challenge 1: The Ghostscript Problem

PDF compression was supposed to use Ghostscript — a battle-tested tool that gives you real compression presets (low, medium, high quality). The plan was clean:

def compress_with_ghostscript(input_path, output_path, preset="ebook"):
    cmd = [
        "gs",
        "-sDEVICE=pdfwrite",
        "-dCompatibilityLevel=1.4",
        f"-dPDFSETTINGS=/{preset}",
        "-dNOPAUSE", "-dBATCH", "-dQUIET",
        f"-sOutputFile={output_path}",
        input_path
    ]
    subprocess.run(cmd, check=True)

The problem? Vercel's serverless runtime doesn't have Ghostscript installed. And installing system packages on Vercel isn't really a thing.

The jugaad: Silent fallback to PyMuPDF compression.

def compress_pdf(input_path, output_path, quality="medium"):
    try:
        compress_with_ghostscript(input_path, output_path, quality)
    except (FileNotFoundError, subprocess.CalledProcessError):
        # Ghostscript not available, fall back to PyMuPDF
        compress_with_pymupdf(input_path, output_path)

Is PyMuPDF compression as good as Ghostscript? No. Is it good enough for a student compressing a form submission? Yes. The tradeoff was acceptable for the target use case.

Lesson: Know your user. A student compressing a 5-page form doesn't need the same quality as a print shop.

Challenge 2: Async Tasks on a Serverless Platform

Summarization and quiz generation take time — sometimes 15-30 seconds depending on PDF size. My original plan used a TaskManager with background threads:

class TaskManager:
    def __init__(self):
        self.tasks = {}

    def create_task(self, task_id):
        self.tasks[task_id] = {"status": "pending", "result": None}

    def run_in_background(self, task_id, fn, *args):
        thread = threading.Thread(target=self._run, args=(task_id, fn, *args))
        thread.start()

This works perfectly on a real server. On Vercel's serverless functions? The thread gets killed the moment the initial HTTP response is sent. The polling endpoint returns nothing.

The jugaad: Switch summarize and quiz to synchronous execution on Vercel. The user waits. The UI shows a spinner. The function either completes or hits Vercel's 60-second timeout.

For small PDFs (under ~15 pages), it completes fine. For large ones, it times out. The fix? Enforce a soft page limit on upload and set honest expectations in the UI.

Not elegant. Ships though.

Challenge 3: DOCX → PDF Is Harder Than It Looks

I assumed converting a DOCX to PDF would be straightforward. python-docx reads the file, fpdf2 renders it. Simple.

It is not simple.

The combination of python-docx + fpdf2 produces acceptable output for plain text documents. The moment your DOCX has tables, custom fonts, images, or complex formatting — it falls apart. Columns collapse, fonts substitute weirdly, images disappear.

The honest truth: good DOCX→PDF conversion requires either LibreOffice (headless) or a paid API. Neither was available to me for free on Vercel.

What I did: kept the feature, documented the limitation clearly. For simple documents it works. For complex ones, the README tells users to use LibreOffice locally.

Sometimes the right answer is just being transparent about what your tool can't do.

The Async Polling Flow (For Features That Need It)

For quiz and summarize, even in sync mode, the frontend uses a polling pattern that was originally designed for async. Here's the simplified version:

async function pollStatus(taskId) {
  const interval = setInterval(async () => {
    const res = await fetch(`/api/status/${taskId}`);
    const data = await res.json();

    if (data.status === "complete") {
      clearInterval(interval);
      fetchResult(taskId);
    } else if (data.status === "failed") {
      clearInterval(interval);
      showError(data.message);
    }
  }, 2000);
}

Even running synchronously, the task ID pattern means the frontend and backend are cleanly decoupled. If I ever move to a real server with proper async, the frontend needs zero changes.

Deployment: Why Vercel (And Why It Kind Of Works)

Everyone told me to use Render or Railway for a Flask app. They're right — those platforms give you a real Linux environment with persistent processes, system packages, and no cold start issues.

But Render's free tier sleeps after 15 minutes of inactivity. Railway has usage limits. For a portfolio project targeting last-minute student use cases, I needed something that just stays up.

Vercel with a vercel.json config works for Flask if you accept the constraints:

No system packages (hence the Ghostscript fallback)
No persistent background threads (hence synchronous AI features)
60-second function timeout (hence the page limits)

For small files and quick tasks? It handles it fine. That's exactly the use case.

What I'd Do Differently

1. Use LibreOffice headless for DOCX→PDF
It produces near-perfect output. The challenge is hosting — it's a heavy dependency. But for a proper deployment, it's worth it.

2. Add explicit file size and page limits on every route
I added them on some routes (compression: 12 pages, quiz: similar). I should have added them everywhere with clear user-facing messages from day one.

3. Show compression method in the UI
When Ghostscript falls back to PyMuPDF, the user should know. Silent fallbacks that return a different quality than advertised are a trust issue.

4. Use a proper task queue
Redis + Celery or even a simple SQLite-backed queue would make the async story clean. In-memory task state means a server restart wipes all pending tasks.

Try It

Live demo: https://ihatepdf-tau.vercel.app
GitHub: https://github.com/Arham-Qureshi/I-hate-PDFs

Best for files under 10-15 pages. Free hosted, so the first load might take a moment.

Final Thought

This project taught me that shipping something imperfect but functional is better than architecting something perfect that never ships. Every "jugaad" in this codebase is a real constraint I hit, a decision I made, and a tradeoff I understood.

That's engineering. Especially when you're broke.

Built with Flask, PyMuPDF, Groq, and the spirit of jugaad. If you found this useful, drop a ⭐ on GitHub.

Tags: #python #flask #webdev #beginners

DEV Community