SEN LLC

Posted on Apr 15

A 60 MB FastAPI Service That Extracts Text From PDFs (and Why It Beats Tika for the 90% Case)

#python #tutorial #fastapi #api

A 60 MB FastAPI Service That Extracts Text From PDFs (and Why It Beats Tika for the 90% Case)

pdftxt-api is a ~200-line FastAPI service built around pypdf. One endpoint — POST /extract — accepts a PDF and returns JSON with the text, page boundaries, and metadata. It's not trying to beat Tika at everything. It's trying to be the right answer for the 90% case: PDFs that already have a text layer, a team that doesn't want to own a JVM, and an ops budget of "one Docker container".

📦 GitHub: https://github.com/sen-ltd/pdftxt-api

The problem nobody wants to solve twice

Every team I've worked with eventually hits the same request: "we need to pull the text out of this PDF." The use case is usually boring and important — search indexing, RAG pipelines feeding a chat model, compliance review, invoice processing, contract diffing. The underlying task is one function call: PDF bytes in, text out.

The answers the ecosystem hands you are all somehow wrong for this shape of problem:

Apache Tika is the obvious thing. It's mature, it handles dozens of formats, and it's a reference implementation. It is also a JVM service. The official apache/tika image is around 200 MB, cold-starts on the order of seconds, and gives you a JRE you're now responsible for patching. If PDFs are the only format you care about, you're paying for Word, PowerPoint, HWP, and fifty mail formats you'll never touch.
AWS Textract / Google Document AI are wonderful at what they do — actual OCR on scanned documents, structured form extraction, table detection. They are also priced per page, rate-limited, and send your documents to a third party. For an internal search index over PDFs that already have a text layer, you're paying cloud OCR prices to do text = pdf.extract_text().
Roll your own around pdfminer.six and you end up with a heavier dependency tree, slower extraction, and code that still has to handle multipart uploads, error shaping, health checks, and logging.

What most teams actually want — and don't realize they want until they see it — is a small HTTP service with one endpoint. Pull it as a sidecar or a deployment. Give it PDFs. Get back text. Don't think about it.

So I wrote one, deliberately tiny, and turned each design choice into a question: is this piece really necessary? The result fits in ~200 lines, ships in a 71 MB Docker image, and has a test suite that takes less than a second.

Design judgment

FastAPI over Flask

If this were 2017, I'd reach for Flask and call it a day. In 2026, FastAPI is the right answer even for a one-endpoint service, and the reason has nothing to do with async.

The killer feature is the automatic OpenAPI spec. A service whose whole existence is "accept a multipart upload and return JSON" will be called from other services, by other teams, maybe in other languages. FastAPI gives me /docs and /openapi.json for free, derived from my Pydantic response models. When someone on the team wants to generate a TypeScript client, or a Go client, or just paste the endpoint into Bruno, the spec is already there. No flask-smorest, no apispec, no manual YAML.

@app.post(
    "/extract",
    response_model=ExtractResponse,
    responses={
        413: {"model": ErrorResponse, "description": "Upload exceeds size limit"},
        415: {"model": ErrorResponse, "description": "Not a PDF"},
        422: {"model": ErrorResponse, "description": "Malformed PDF"},
    },
)
async def extract(
    request: Request,
    file: UploadFile = File(..., description="PDF file to extract"),
    format: str = Query("json", pattern="^(json|plain)$"),
) -> Union[ExtractResponse, PlainTextResponse, JSONResponse]:
    ...

Those responses={} entries aren't just docs — they end up in openapi.json with the right ErrorResponse schema, so a generated client knows exactly what 413 and 415 look like.

`pypdf` over `pdfminer.six` or `pdfplumber`

pypdf is the maintained fork of the original PyPDF2. It's pure Python, has surprisingly complete PDF spec coverage, and its text extraction is good enough for documents that have an embedded text layer. Its install footprint is small, it has no native code, and — critically for a service that runs in a container — it works on Alpine without a C toolchain.

pdfminer.six is more thorough for certain layouts, but it's slower, pulls a bigger dep tree, and the API is notoriously fiddly. pdfplumber is a nicer API on top of pdfminer.six, so it inherits those costs. For a service that promises "text in, text out" — and is honest about not doing layout analysis — pypdf is the right trade. The pitch to users is clear: we'll give you the text stream, in page order, and tell you how many pages we processed. If you need tables reconstructed, use a different tool.

`uvicorn` without `gunicorn`

The Flask world trained us to always put gunicorn in front of the app. For this workload that's unnecessary overhead. There is no fork-per-request model here; this is a single Python process handling multipart uploads, doing CPU-bound PDF parsing, and returning JSON. Preforking buys you exactly nothing, and it adds a process manager you have to configure. uvicorn alone, with the orchestrator (Kubernetes, ECS, your laptop) handling replicas, is simpler and lighter.

Magic bytes, not MIME types

One small thing I care about: never trust client-reported content types for security-relevant guards. A browser or a curl invocation can claim Content-Type: application/pdf for any payload. The actual check is whether the bytes start with the PDF magic sequence %PDF-:

PDF_MAGIC = b"%PDF-"

def looks_like_pdf(data: bytes) -> bool:
    """Check PDF magic bytes. Client-reported MIME type is not trusted."""
    return data[:5] == PDF_MAGIC

Five bytes. That's the entire content-type guard. Combined with a hard size limit (see next section), it means a caller can't send 19 MB of random bytes and have the server hand them off to pypdf to discover they're garbage. The rejection happens before we even touch the parser.

Size limits as memory protection

The MAX_UPLOAD_MB env var defaults to 20 MB and causes the server to return 413 payload_too_large for anything larger. This isn't a UX preference, it's memory protection. pypdf needs the full document in memory to build its cross-reference table — you can't meaningfully stream PDF parsing without a different engine. If I don't cap uploads, a single malicious request can OOM the container.

20 MB is the default because that covers essentially every invoice, contract, academic paper, and report you'll care about. Ops can bump it if they're indexing 500-page filings.

Structured JSON logging in the middleware

One of my strong opinions is that services should emit structured logs from day one, not "we'll add that in sprint three". For a small service, a full logging config is overkill. A tiny ASGI middleware is plenty:

class JsonRequestLogger:
    def __init__(self, app: ASGIApp) -> None:
        self.app = app

    async def __call__(self, scope, receive, send) -> None:
        if scope["type"] != "http":
            await self.app(scope, receive, send)
            return
        start = time.perf_counter()
        status_code = 500
        response_bytes = 0

        async def send_wrapper(message):
            nonlocal status_code, response_bytes
            if message["type"] == "http.response.start":
                status_code = int(message.get("status", 500))
            elif message["type"] == "http.response.body":
                response_bytes += len(message.get("body", b"") or b"")
            await send(message)

        try:
            await self.app(scope, receive, send_wrapper)
        finally:
            duration_ms = (time.perf_counter() - start) * 1000.0
            sys.stdout.write(json.dumps({
                "method": scope.get("method"),
                "path": scope.get("path"),
                "status": status_code,
                "duration_ms": round(duration_ms, 2),
                "bytes": response_bytes,
            }, separators=(",", ":")) + "\n")
            sys.stdout.flush()

One line per request, one stdout write, zero external deps beyond starlette.types. Any log shipper (Vector, Fluent Bit, CloudWatch agent) can pick these up directly because they're already valid JSON.

Implementation highlights

The service separates cleanly into two modules: extractor.py (pure, no FastAPI) and main.py (the web layer). The split matters because I can unit-test extraction against PDF bytes without spinning up a TestClient, and I can reuse the extraction logic in a different transport (a CLI, a batch job, an SQS consumer) later with zero changes.

The core extraction function:

def extract_text(data: bytes) -> ExtractionResult:
    if not looks_like_pdf(data):
        raise NotAPdfError("payload does not start with %PDF- magic bytes")

    try:
        reader = PdfReader(io.BytesIO(data), strict=False)
    except (PdfReadError, PdfStreamError, ValueError, OSError) as exc:
        raise CorruptPdfError(f"pypdf failed to parse document: {exc}") from exc

    pages: List[PageText] = []
    for idx, page in enumerate(reader.pages, start=1):
        try:
            page_text = page.extract_text() or ""
        except Exception as exc:
            raise CorruptPdfError(f"pypdf failed on page {idx}: {exc}") from exc
        pages.append(PageText(page=idx, text=page_text))

    full_text = "\n\n".join(p.text for p in pages if p.text)
    meta = PdfMetadata(pages=len(pages))
    raw_meta = reader.metadata
    if raw_meta is not None:
        meta.title = _clean(raw_meta.get("/Title"))
        meta.author = _clean(raw_meta.get("/Author"))

    return ExtractionResult(text=full_text, page_count=len(pages), pages=pages, metadata=meta)

Two things worth pointing out:

Page-by-page accumulation, not whole-document. I explicitly walk reader.pages and keep each page's text as its own record. That's what makes the API useful for RAG pipelines — you can map embeddings back to page numbers — and for citation UIs. It also means if one page is garbage, I know which page.
Catch broad and re-raise narrow. pypdf's error types have churned over versions, and bad pages occasionally raise things that aren't subclasses of PdfReadError. Catching Exception inside the page loop and re-raising as CorruptPdfError is a deliberate choice: at the API boundary I want exactly three categories — not a PDF, a corrupt PDF, or a successful extraction — and everything else is an internal detail.

Error shaping at the route layer is a tiny helper plus three except blocks:

def _error(status: int, error: str, detail: str) -> JSONResponse:
    return JSONResponse(
        status_code=status,
        content=ErrorResponse(error=error, detail=detail).model_dump(),
    )

try:
    result = extract_text(data)
except NotAPdfError as exc:
    return _error(415, "unsupported_media_type", str(exc))
except CorruptPdfError as exc:
    return _error(422, "unprocessable_pdf", str(exc))

Every error has the same JSON shape: {"error": "stable_code", "detail": "human message"}. The error field is the stable contract (clients switch on it), detail is for humans reading logs.

Tradeoffs and learnings

No OCR is a feature, not a bug. The most important thing to be upfront about is that this service does not do OCR. A scanned PDF with no embedded text layer will come back with an empty text field. That's not a regression to fix later; it's the service's scope. Trying to bolt Tesseract in would triple the image size, add a native dep, and blur the service's identity. If you need OCR, run a separate service and route scanned docs there — don't hybridize.

Layout is hard, and pypdf doesn't pretend otherwise. Tables, multi-column articles, and fancy brochures come out in whatever reading order pypdf's text extraction decides on, which is not always what a human would choose. If you care about layout reconstruction, you need a real layout analyzer (unstructured, pdfplumber, PyMuPDF), all of which cost you image size, native dependencies, or licensing complexity. The lesson I keep re-learning: a service is most valuable when it's honest about what it can't do. Every response includes a metadata.pages field, and if text is empty that's a signal, not a bug report.

Memory, not streaming. I considered streaming the upload and processing incrementally. That's not really possible with PDF: the format stores the cross-reference table at the end of the file, so you need the whole document to parse anything. A streaming-capable engine would be a different project. The pragmatic answer is a hard size cap plus clear documentation.

Errors as first-class citizens. Early drafts of the API returned 500 when pypdf choked. That's wrong: a malformed PDF is a client problem, not a server problem. Teaching the extractor to raise two specific exception types, and teaching the route to map them to 415 and 422, made the API predictable. Clients can now write simple retry logic: 5xx → retry, 4xx → log and move on.

Try it in 30 seconds

git clone https://github.com/sen-ltd/pdftxt-api
cd pdftxt-api

docker build -t pdftxt-api .
docker run --rm -p 8000:8000 pdftxt-api

# In another shell
curl -F "file=@tests/fixtures/sample.pdf" http://localhost:8000/extract | jq
open http://localhost:8000/docs

The image is ~72 MB. docker run to first response is under a second on my laptop. The test suite runs inside the image via docker run --rm --entrypoint pytest pdftxt-api -q.

If you have a one-endpoint need that Tika is currently solving for you, try swapping in pdftxt-api for a week and watch what happens to your image size, your cold starts, and the pile of JVM config you no longer have to care about. It's the kind of tool whose whole pitch is that you stop thinking about it.

DEV Community

A 60 MB FastAPI Service That Extracts Text From PDFs (and Why It Beats Tika for the 90% Case)

A 60 MB FastAPI Service That Extracts Text From PDFs (and Why It Beats Tika for the 90% Case)

The problem nobody wants to solve twice

Design judgment

FastAPI over Flask

`pypdf` over `pdfminer.six` or `pdfplumber`

`uvicorn` without `gunicorn`

Magic bytes, not MIME types

Size limits as memory protection

Structured JSON logging in the middleware

Implementation highlights

Tradeoffs and learnings

Try it in 30 seconds

Top comments (0)

A 60 MB FastAPI Service That Extracts Text From PDFs (and Why It Beats Tika for the 90% Case)

The problem nobody wants to solve twice

Design judgment

FastAPI over Flask

pypdf over pdfminer.six or pdfplumber

uvicorn without gunicorn

Magic bytes, not MIME types

Size limits as memory protection

Structured JSON logging in the middleware

Implementation highlights

Tradeoffs and learnings

Try it in 30 seconds

`pypdf` over `pdfminer.six` or `pdfplumber`

`uvicorn` without `gunicorn`