DEV Community

Olebeng
Olebeng

Posted on

Why we only accept .txt for document uploads - and why that is the right call for now

IntentGuard lets users upload specification documents alongside their repository when submitting an audit. The Intent Agent uses these documents — a product requirements document, an architecture spec, an API reference — to build a higher-confidence model of what the codebase was supposed to do before reading a single line of code.

Currently, we only accept .txt files.

Every few days someone asks why. The honest answer is worth a post.

PDF is not a text format

When you open a PDF in a viewer, you see clean, readable text. What the viewer is actually doing is interpreting a stream of rendering instructions — glyph positions, font mappings, coordinate transforms — and reconstructing what looks like text from absolute positions on a page.
pdfminer.six, the standard Python library for PDF text extraction, reverses this process. It reads the rendering instructions, maps glyphs to Unicode characters using whatever font encoding the PDF creator chose, and attempts to reconstruct reading order from the x/y coordinates of each glyph.

This works well for simple, single-column, machine-generated PDFs. For anything more complex — multi-column layouts, tables, scanned documents, PDFs exported from tools that embed fonts as bitmaps — the extracted text can look plausible while being subtly corrupted. Column order gets swapped. Table cells merge. Headers appear in the middle of paragraphs.
Corrupted structure passed to an intent analysis pipeline does not produce an obvious error. It produces quietly wrong intent claims — which is worse.

The security concern

PDFs can contain embedded JavaScript, OpenAction triggers that fire on open, malicious stream objects, and external URI references. Processing untrusted PDFs without a purpose-built sandboxed parser is a real attack surface. pdfminer has had CVEs. Handling untrusted binary formats in a pipeline that processes proprietary codebases is not a decision to make under time pressure.

DOCX has a different surface: Office Open XML relationships to external resources, embedded objects, and macro containers. python-docx handles the common case cleanly but edge cases involving embedded objects or external references require careful sanitisation before any content reaches the analysis layer.

Why .txt is not a cop-out

A plain text file is deterministic. There is no binary parsing, no font mapping, no coordinate reconstruction, no embedded objects. It goes into the chunker directly. Its encoding is validated at upload. Its size is enforced client-side at 50KB per file, up to five files.

The result is that a founder who pastes their product spec into a .txt file gets more reliable intent analysis than one who uploads a beautifully formatted PDF that extracts poorly. Readable structure matters more than file format.

What is coming

PDF and DOCX upload support is in the Phase D roadmap. The correct approach is a purpose-built extraction pipeline with: sandboxed processing, content validation before the text reaches the chunker, encoding normalisation, and its own test suite. It deserves a dedicated build session and a security review — not a quick dependency add before launch.

Until then: .txt, and it works well.

Building IntentGuard in public from Johannesburg 🇿🇦. If you have built document ingestion pipelines that handle untrusted binary input safely, I'd like to hear how you approached the sandboxing problem.

The concepts discussed are my own, the presentation and formating of this post is enhanced by an AI assitant.

Olebeng · Founder, IntentGuard · intentguard.dev

Top comments (0)