Two audits of my own knowledge graph found two unrelated silent failures

#ai #dataengineering #rag #rust

I'm a systems administrator in the Dominican Republic, and for the past several months I've been building ANIMUS on the side — a Rust-based autonomous knowledge graph system running over a real regulatory corpus (Dominican banking regulation PDFs, 800+ documents from the Superintendencia de Bancos). No team, no funding, just evenings and weekends after a day job in application support.

Twice now, auditing my own pipeline has turned up a failure mode I wasn't looking for. Neither one threw an error. Both were silently "successful" right up until I went and actually counted things that should have been fine.

I think the pattern is more interesting than either bug on its own, so here's both.

Audit #1 — 52% of the graph was duplicate nodes

I'd been re-running the ingestion pipeline after fixes, with no uniqueness check in place. The same regulatory circular got added to the graph multiple times under slightly different labels — once from the original ingestion, again after a re-run, sometimes a third time after a partial reprocess.

Node count looked great. Over half of it was redundant. The graph looked bigger and more capable than it actually was, and retrieval was quietly skewed toward whichever duplicate happened to match on a given query.

The fix was a deduplication pass keyed on document similarity, not just exact filename matches — duplicates didn't always share a name. I wrote up the full debugging story when I found it: How I Found Out 52% of My Knowledge Graph Was Duplicates.

Audit #2 — 32% of the source corpus never made it into the graph at all

This one was worse, because the failure was invisible by design.

My PDF ingestion script marks a document as successfully processed if the extracted text is longer than 100 characters. That's a reasonable-looking sanity check — empty or near-empty extractions usually mean something went wrong. But:

anything under that threshold gets silently dropped, with no integration into the graph
the source file still gets moved into the "processed" folder regardless of whether it succeeded

So the folder said "done" for every single file. The graph quietly had 262 out of 817 documents — 32% of the corpus — missing, with no log entry explaining why, because as far as the pipeline was concerned, nothing had failed.

Root cause: my text extraction (PyMuPDF) only reads text that's already embedded in the PDF. A meaningful chunk of the corpus turned out to be scanned or digitally signed documents with no real text layer — old circulars digitized as images, signed PDFs with the signature flattening the page into a scan. Those come back nearly empty, fail the 100-character check, and vanish without a trace.

Adding an OCR fallback (pytesseract + Tesseract, Spanish language pack) recovered 259 of the 262. The first pass used a low render resolution and produced text with a lot of noise; bumping the DPI and adding basic image preprocessing before OCR helped, though I haven't fully solved the noise problem — more on that below.

The follow-up problem I haven't solved yet

Some documents that passed the 100-character check — meaning my pipeline considered them fine — turned out to have corrupted native text anyway. My best guess is that the original PDF was digitized with low-quality OCR years ago, by whoever scanned it originally, and that corrupted text got permanently baked into the file. PyMuPDF extracts it faithfully; the extraction isn't broken, the source already was.

One real example from the corpus: a circular about reporting suspicious transactions to the financial intelligence unit extracts with "Entidades" rendered as "Entídodes" and "financiero" as "finonciero." A human can read past that. Keyword-based retrieval can't — the words in the question never match the corrupted words in the document, so the system reports "no information found" about a document that's sitting right there in the graph.

Detecting "text exists but is garbage" turns out to be a meaningfully different problem than detecting "text doesn't exist." I don't have a clean check for it yet — my first instinct (looking for repeated consecutive words, a common OCR artifact) caught exactly one case out of hundreds. I'd genuinely like to hear how other people doing retrieval over scanned or historical document corpora have approached this, because it feels like it should be a solved problem somewhere and I just haven't found the right keyword to search for it.

What this did to actual benchmark numbers

I run a 36-question evaluation set against the corpus — factual questions answerable from a single document, multi-source questions requiring two or more related documents, and adversarial "trap" questions where the correct answer is "this information isn't in the corpus," designed to catch hallucination.

Corpus coverage went from 68.5% to 99.6% of source documents after the OCR recovery pass. Overall benchmark score went from 58.3% to 63.9%. Honesty on the trap questions hit 100% — zero hallucinated answers.

Multi-source questions, interestingly, didn't move much in aggregate. Several individually went from completely failing to partially succeeding (finding one of two required documents instead of zero), but that was offset by a handful of regressions traced to a separate, unrelated issue: low-value "reflection" nodes — essentially the system's own past answers, including "I couldn't find anything" responses — were crowding out real source documents in a retrieval step that's biased toward recent nodes. That's a different rabbit hole than the one in this post, but it's a good reminder that fixing one silent failure can expose another one sitting right behind it.

Disclosure

This post was drafted with the help of an AI assistant, working from my own debugging notes and logs. The investigation, the fixes, and the numbers are mine — I'm not a native English speaker and writing this entirely from scratch in English isn't something I can do quickly, so I used AI assistance for the writing itself.

Paper and dataset: Zenodo
Code: github.com/ernestoariasdiaz/animus-ai

If you've hit either of these — silent ingestion failures, or "passed the check but the content is garbage" — I'd like to hear about it in the comments.

Top comments (1)

Tae Kim • Jun 29

The duplicate-node pattern is one of the sneakiest because it stays invisible to the retriever until you count. What made dedup reliable in my experience was treating node identity as a two-stage gate: entity normalization first (surface form to canonical ID), then a uniqueness constraint at write time. The second stage alone catches re-ingestion, but without the first you're still merging different surfaces to different nodes.