Mohamed Elsayed Ali

Posted on May 7

# How I Built a PDF Processing Pipeline Designed to Scale Toward 10k Users — FastAPI, Inngest & Smart Async Design

#fastapi #webdev #tutorial #architecture

I recently built a PDF processing pipeline inside my project Playground.

The goal was simple:

Users upload PDFs, the API responds immediately, and heavy processing happens in the background without blocking request latency.

That sounds straightforward until traffic increases.

A naive implementation works fine for a few users — but starts falling apart when many uploads arrive at once.

In this article, I’ll walk through the architecture, the async design decisions, and a few lessons I learned while building it.

Repository: Playground on GitHub

The Problem

A common first implementation looks something like this:

receive upload
parse PDF inside the request
extract text
store results
return response

It works — until multiple users upload files concurrently.

The biggest issue is that PDF parsing is not purely async work.

Libraries like pdfplumber perform CPU-heavy parsing. If that happens directly inside an async route, the event loop gets blocked.

That creates a chain reaction:

slower response times
reduced concurrency
request pileups under load

In short:

Async helps with I/O. It does not automatically solve CPU-bound work.

My Solution

Instead of processing inside the request lifecycle, I split the pipeline into two stages.

Request path

The API only handles lightweight operations:

validate file type
perform idempotency checks
upload raw file to S3
create database record
trigger background job

Then it returns immediately.

Background worker

The worker performs heavy tasks:

download from S3
scan file
parse PDF
extract structured data
save final result

Architecture

Client
  ↓
FastAPI upload endpoint
  ↓
S3 object storage
  ↓
Inngest event
  ↓
Background worker
  ↓
Database

That separation keeps the API responsive even when parsing becomes expensive.

Why Idempotency Matters

One subtle issue in async pipelines is duplicate execution.

A user may retry the upload.
A network timeout may happen.
A background worker may retry after failure.

Without idempotency, the same file can be processed multiple times.

My upload endpoint checks for an existing idempotency key before doing any heavy work.

result = await db.execute(
    select(IdempotencyKey).filter(IdempotencyKey.key == idempotency_key)
)
existing_key = result.scalars().first()

if existing_key and existing_key.response:
    return json.loads(existing_key.response)

That small check prevents duplicate parsing and duplicate storage.

In production systems, this becomes surprisingly important.

Upload Endpoint: Return Fast

The upload route stores the file and immediately schedules background processing.

await inngest_client.send(
    inngest.Event(
        name="pdf/upload.requested",
        data=event_payload,
    )
)

return {
    "file_id": file_record.id,
    "status": FileStatus.PROCESSING.value,
    "message": "PDF uploaded. Processing in background.",
}

The client receives a response immediately and can poll later for status updates.

That means the API is optimized for fast acknowledgement, not heavy computation.

CPU-bound Parsing Needs a Thread Boundary

The worker downloads the file and parses it in the background.

The important part is this:

file_bytes = await asyncio.to_thread(blocking_work)
pdf_data = await asyncio.to_thread(_parse_pdf_bytes, raw_bytes)

Why?

Because:

S3 SDK calls are blocking
PDF parsing is CPU-heavy

If those ran directly inside async functions, they would block the event loop.

asyncio.to_thread() moves that work off the main async execution path.

That was one of the most useful lessons from building this pipeline.

Background Workflow

The worker itself follows a predictable sequence:

Download file from S3
Scan file
Parse PDF pages
Extract structured metadata
Save results to database

The parsing step returns:

page-by-page text
extracted tables
full combined text

That makes the output usable for both search and downstream LLM enrichment.

Error Handling and Audit Logging

One thing I cared about early was observability.

Every major step writes audit events:

upload received
validation failed
S3 stored
processing started
worker completed
worker failed

That made debugging much easier than relying only on exception traces.

When async systems grow, visibility becomes just as important as the code itself.

Lessons Learned

1. Async is not magic

Async improves I/O concurrency.

It does not automatically make CPU-heavy workloads scalable.

2. Keep request handlers thin

Request handlers should do only what is necessary to accept work.

Heavy processing belongs in workers.

3. Idempotency is underrated

Retries are normal in distributed systems.

Designing for safe retries makes the whole pipeline more reliable.

4. Background jobs improve system stability

Moving parsing out of request-response flow improves:

latency
resilience
operational predictability

What I’d Improve Next

A few things I’d like to add next:

dedicated worker concurrency limits
chunked parsing for very large PDFs
metrics around parse duration and queue latency
smarter retry classification for transient failures

Final Thoughts

This project was a useful reminder that scaling is often less about “making code faster” and more about putting work in the right place.

The biggest architectural decision was simple:

return fast, process later.

That one decision changed the entire behavior of the system.

Full repository:

Playground GitHub repository

DEV Community