DEV Community

John Wade
John Wade

Posted on • Edited on

When the Audit Contradicts Itself: Running a Knowledge Extraction Pipeline Under Load


A coverage audit graded my knowledge extraction B+. It also contradicted itself. The audit said all four processing passes were complete while its own phase field showed only three had run. It identified five extractions as high-quality, then noted that three of them were about the review methodology rather than the subject matter. And it had missed the conversation's most important finding entirely: the recommended ontological foundation for the entire project had no extraction at all.

That's the audit working. Not failing. Working.

In the same session, I ran a validation script against 247 knowledge nodes in my vault. It flagged 25 terminality violations. Twenty-three of them turned out to be the script reading its own documentation — quality checklists, protocol examples, table rows containing the very rules the scanner was enforcing, all triggering the scanner as violations of the rules they described. After tuning, the count dropped to two genuine issues. Those two were real.

I'd built a protocol-governed pipeline for extracting structured knowledge from LLM conversations. Not summaries — structured nodes with frameworks, cross-references, open questions, and adversarial review. I ran it on two conversations: one was forty messages of dense epistemological analysis, the other eight messages of cross-disciplinary synthesis. Ten knowledge nodes came out the other end. The protocol loaded fifteen thousand characters of governance documents before any work began, dispatched parallel agents, ran multi-pass extraction with triage and verification, and hit the context limit six times in a single session.

Everything worked. Everything also had problems. That turned out to be the point.

Part 6 of Building at the Edges of LLM Tooling. If you're running protocol-governed extraction — multi-pass processing, parallel agents, coverage audits — the governance documents that make the work reliable also consume the context window that makes it possible. Start here.


Why It Breaks

Protocol-governed LLM work has a structural tension at its center. The governance documents that make the work reliable also consume the context window that makes the work possible.

My extraction protocol — a skill called /process — required loading a processing protocol, extraction guidelines, an epistemic governance note, a concept registry, framework selection rules, and coverage audit templates before it touched any source material. That's the context tax: fifteen thousand characters of instructions eaten before the first conversation message enters the window. The more governance you add, the less room remains for the actual work. The less governance you add, the more the extraction drifts.

And the context window isn't just finite — it resets. My session hit six compaction events. Six times the context filled up, six times the system compressed everything into a summary and continued from there. Each compaction wiped the tool state. The model had to re-read every file it wanted to edit because the prior reads were gone. Protocol documents that had been loaded at the start needed re-loading after each compression. The governance layer was being rebuilt from scratch every time the window refilled.

The coverage audit has its own version of this problem. The audit is written by the same model, in the same session, under the same context pressure as the extraction it's evaluating. When the audit contradicted itself about how many passes had run, that wasn't a logic error — it was the audit operating at the edge of its own context. The earlier passes had been compressed, and the model was reconstructing completion status from a summary rather than from the actual execution record. The auditor and the work being audited share a context window, and that window is always losing information.


What I Tried

The protocol used three mechanisms to manage this: multi-pass processing, parallel agents, and operator overrides.

Multi-pass broke the work into phases. Pass one triaged the conversation into tiered blocks — which segments warranted standalone extraction, which were supporting material, which could be skipped. Pass two ran deep extraction on the top-tier blocks. Pass three verified coverage and cross-checked the extractions against the source. Each pass produced a persistent artifact — a coverage audit file, updated after each phase — so the protocol's state lived in the filesystem, not in the model's memory. When the context compressed, the audit file survived.

Parallel agents scaled the extraction. Once the triage identified targets, three independent agents ran simultaneously, each reading the conversation independently and producing a complete knowledge node. This wasn't a performance optimization — it was a context management strategy. Each agent got a fresh context window with the protocol documents and the source material, unburdened by the other agents' work. Parallelism bought context space.

Operator overrides handled what the protocol couldn't anticipate. The standard process capped extraction at three nodes per conversation, a density guardrail to prevent over-extraction. But one conversation was unusually dense — the triage identified seven extractable concepts. I looked at the triage output, opened the coverage audit, and overrode the cap: "we're going to perform the exhaustive processing on this one so continue." The term "exhaustive processing" wasn't in the protocol. The system adopted it. By session end, it appeared in the coverage audit as the official processing label.

The validation scripts needed a different kind of fix. The terminality scanner was correctly identifying the patterns it was designed to catch, but the protocol documentation contained examples of those patterns as part of its own rules. The fix was exclusion logic: skip blockquotes, quality checklists, and table rows. Twenty-five flags became two. A framework-coverage script had a similar false positive — a template section triggering a 71% coverage alert for a framework that didn't exist in the actual nodes. Same principle: teach the scanner to distinguish content from documentation.


What It Revealed

Protocol governance for LLM work doesn't need to be flawless. It needs to make problems visible.

The coverage audit graded the extraction B+. The node quality was genuinely high — A-range work on what was written. But three of five extractions were about how the source conversation conducted its analysis, not what the analysis actually found. Methodology over substance. And two major findings had no node at all. One was the conversation's explicit recommendation to adopt perspectival realism as the project's ontological foundation — a foundational design decision. The other was a concept the conversation identified as one of three genuinely novel elements in the entire project. Both were mentioned in passing inside other nodes. Neither had its own extraction.

The audit caught this. The same audit that contradicted itself about its own completion status correctly identified the substantive gap in the extraction. Quality control with its own quality problems still surfaced the real issue. That's what the session confirmed.

The validation script told the same story from a different angle. Twenty-three false positives sound like a broken tool. But the two genuine violations were real terminality issues in older nodes that I hadn't caught manually. The scanner worked. It also needed calibration. Both of those things are true, and both matter.

There was a third thing the session revealed, legible only in retrospect. When I said "we're going to perform the exhaustive processing on this one so continue," I was solving a tactical problem — the triage had identified seven concepts in an unusually dense conversation, and the standard cap wasn't right for this one. "Exhaustive processing" wasn't a protocol term. By session end it appeared in the coverage audit as the official label for that processing mode, as if it had always been there. The model had adopted my ad hoc terminology and integrated it into a persistent artifact without ceremony. I noticed something specific: operator language introduced under pressure was propagating into durable system state. The coverage audit survives compaction. My impromptu phrasing was now in a document that would inform future sessions. I hadn't edited the protocol. I had co-authored it, without intending to, by choosing particular words at a particular moment.

The protocol's role turned out to be less like a rulebook and more like a type system. A type checker doesn't prevent all bugs — it prevents certain classes of bugs at compile time so they don't become runtime failures. The extraction protocol didn't prevent methodology-over-substance imbalance, and it didn't prevent its own audit from contradicting itself. What it did produce was a structured artifact — the coverage audit — where those problems were visible, catchable, and fixable before the extraction entered the knowledge base as verified. The operator who looked at the B+ grade and said "Spawn nodes for Perspectival Realism and Interference Zones, fix the coverage audit contradictions" was doing exactly what the protocol was designed to enable: catching problems that the automated layer surfaced but couldn't fix on its own.


The Reusable Rule

If you're building structured workflows with LLMs — extraction pipelines, multi-step analysis, anything that needs consistency across outputs — the constraint document is doing more work than the model.

The diagnostic: when your LLM produces inconsistent output across a multi-step process, check whether the process has governance documents at all. A protocol that defines pass structure, output format, and quality gates does more to stabilize output than prompt iteration alone. The model wasn't failing to produce good work. It was producing good work for an underspecified brief.

Quality control that contradicts itself is still quality control. A coverage audit with bookkeeping errors that catches the project's missing ontological foundation is more valuable than no audit at all. A validation script with twenty-three false positives that finds two genuine issues is more valuable than no script. What I'd do again: build the quality layer. Calibrate it. Don't wait for it to be perfect before trusting it — it will find real problems on day one, and you can fix its own problems as they surface.

The protocol's value isn't in being correct. It's in making errors visible early enough that a human can catch them. The model writes the audit. The human reads the grade. That's the division of labor — and it's the system working as designed.

Top comments (0)