How I Built a Web Interface for 1.4 Million Government Documents with FastAPI, HTMX, and SQLite

#showdev #python #webdev #opensource

When government agencies release the same document multiple times with different redaction patterns — which happens more often than you'd think across FOIA batches, Congressional releases, and litigation disclosures — it's possible to cross-reference the releases and algorithmically recover the hidden text. I built a tool called Unobfuscator to do exactly this with the entire Epstein corpus.

The problem is, Unobfuscator's output is a SQLite database. This is annoying but workable if you're a developer. It's useless if you're a journalist, investigator, or anyone else who actually needs to find things in 1.4 million documents.

So I built TEREDACTA — a web interface that makes those recoveries searchable and explorable.

The Dataset

The current deployment covers the Congressional Epstein/Maxwell releases: DOJ volumes, House Oversight releases. 1.4 million documents, 15,220 document match groups, 5,600+ substantive recovered passages.

Some of what's been recovered is genuinely significant — internal BOP/MCC emails about Epstein's case, staff interview lists mapping shifts at the Manhattan Correctional Center, FBI evidence logs with 113 recovered passages, and Ghislaine Maxwell's own PR response drafts.

The Stack

Here's the part that might surprise you: FastAPI + HTMX + Jinja2 + SQLite. That's it. No React, no Vue, no webpack, no npm, no build step. The entire frontend is HTMX with vendored JS and server-rendered templates.

It sounds like it shouldn't work for an interactive investigation tool, but it does.

Why HTMX

I'll be honest — when I started this project, I assumed I'd end up reaching for React or at least Alpine.js. The feature requirements looked like they needed a proper frontend framework: boolean search with real-time results, an entity relationship explorer, a document viewer with highlighted recovered passages, progress indicators for long-running operations.

HTMX handles all of it. Partial page updates via hx-get and hx-swap. Server-sent events for real-time progress on operations that take more than a moment. The result is an interface that feels reactive without shipping a single line of application JavaScript.

The advantage isn't just simplicity — it's debuggability. When something breaks, there's no component tree to inspect, no state management layer to untangle, no build pipeline to suspect. It's HTTP requests and HTML responses. The browser's network tab tells you everything.

The Performance Problem

Boolean search across 1.4 million documents needs to return results fast enough that investigators don't lose their train of thought. "Fast enough" in this context means under 2 seconds, and ideally under half a second.

SQLite is the database, and SQLite is single-writer. For a read-heavy investigation tool, this is actually fine — reads are concurrent and the dataset is static (new documents get added in batches, not continuously). But the query planning needed careful attention.

The entity index — people, organizations, locations, emails, phone numbers extracted from recovered text — lives in a separate SQLite database. This was a deliberate choice for query isolation. The entity queries (which involve relationship traversal) have very different access patterns than the document search queries, and separating them means neither workload contaminates the other's page cache.

Cold-cache performance was the real challenge. First query after a restart could take 10+ seconds as SQLite populated its page cache. The fix was careful index design and strategic PRAGMA tuning — mmap_size for memory-mapped I/O, cache_size for the page cache — rather than adding an external caching layer. Adding Redis or Memcached to what is otherwise a zero-dependency Python app would've been architectural vandalism.

Security Considerations

This is a tool that lets people search through government documents. The security model needs to be airtight not because the data is secret (it's publicly released), but because the tool could be a target.

Authentication uses signed cookies with CSRF tokens. The Unobfuscator database is read-only — the application has no write access to the source data. Input validation includes regex backtracking prevention (a real attack vector against search tools that accept user-supplied patterns). The whole thing runs behind Caddy with automatic TLS.

What I'd Do Differently

The SSE implementation for real-time progress went through three iterations before I got the connection management right. Server-sent events sound simple, but handling client disconnection, reconnection, and the inevitable proxy buffering issues (Caddy, Cloudflare, etc.) required more thought than I expected. I'd document this pattern better from the start.

Try It

Everything is MIT licensed. The methodology is transparent and auditable — this is algorithmically recovered text from publicly released documents, not guessed or AI-generated content.

[Live site] | [GitHub]