Debug Log #2 — The Off-By-One That Didn’t Crash (It Just Lied)

#invariants #dataengineering #testing #debugging

I built a local pipeline to take long chat transcripts saved as PDFs and turn them into something structured, cleaned output where every conversational turn is rewritten into paired labels:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

That pairing is the contract. It’s what makes the transcript auditable instead of just scrollable.

The Symptom

When doing a last integrity pass, I opened the cleaned PDF to confirm the labeling holds from start to finish. But right at the beginning the artifact was telling me a different story:

INPUT 2 / OUTPUT 1
INPUT 3 / OUTPUT 2

The system was still alternating input/output, the output existed, the pipeline completed. But the numbering was shifted from the first turn. The system runs, the output exists, and the output is quietly lying by one. That lie ripples into every downstream count, integrity check, and assumption built on top of it.

The real question became: where is the first place the system starts lying?

Initial Confusion

At first I kept framing it as a counting issue, maybe something in the missing-input/missing-output analysis, maybe a reporting mismatch, maybe the integrity summary was slightly off. I didn’t want to rerun the entire dataset just to test a small correctness problem, so I tried to do it the right way: make a small sample input, isolate the stage, validate expected versus actual.

That immediately raised practical questions I couldn’t dodge. Where do I even inject a sample? If my entrypoint starts at PDFs, how do I test a mid-stage without breaking the whole flow? If I create a CSV, which CSV does the stage actually expect?

The framing itself was the problem. I was treating it like a reporting bug when it was actually a contract bug.

What the Bug Really Was

The system was never meant to count like:

INPUT 1, OUTPUT 2, INPUT 3, OUTPUT 4...

It was meant to preserve paired conversational turns:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

So if the cleaned PDF starts at INPUT 2 / OUTPUT 1, the core failure isn’t in downstream analysis. The numbering contract is being violated somewhere upstream, and everything else is just inheriting the damage. Reframing it that way collapsed the search space immediately. Stop looking at reporting, trace back to wherever the labels get written in the first place.

The Trap I Almost Fell Into

Before that reframe landed, I tried to build a debug input using raw “you said / chatgpt said” style text, because that’s what I visually associate with the PDF source. But a test fixture only helps if it matches the contract of the stage you’re actually testing. Some stages in the pipeline don’t consume raw conversational text, they consume already-columnized CSV data. Feed the wrong-shaped input into the wrong layer and you’re not debugging the system anymore. You’re debugging a mismatch you created.

That was one of the real lessons of this log: if your mental model of the pipeline layers is even slightly off, you can do a lot of work that produces zero signal.

Tracing Back to the First Lie

The way it became solvable was tracing backward from the artifact I trusted until I found the first divergence.

Start with the cleaned PDF, numbering is wrong at the first turn. Work backward through the pipeline outputs and stage boundaries. At each boundary ask: is the numbering still correct here, or did it break here? The moment a layer is confirmed correct, stop blaming it and move earlier.

That tracing forced a clear outcome. The numbering wasn’t being broken by the analysis layer. It wasn’t something happening at the end. It was being introduced in the ingestion and cleaning step, the part of the system that writes the labels in the first place.

Root Cause

The offset wasn’t random drift. It was a systematic base shift baked in from the start.

I remembered why: sometimes when copying a thread, the first “You said:” label doesn’t exist the way the parser expects, so I had added logic to bootstrap the first input anyway. The intention was correct, recover from messy real-world formatting. But the implementation created a permanent misalignment. Input and output were being advanced out of sync at the very beginning, so everything after stayed consistently off by one.

The bug didn’t need to crash to be real. It just needed to violate the contract once.

The Fix

The fix was structural, not a patch. Instead of two separate counters drifting against each other, the labeling logic was rebuilt around a single turn counter that increments only when an INPUT is encountered or injected, labels OUTPUT using that same turn number, and ensures the edge-case injection doesn’t double-increment the first real turn. The goal was to make it structurally impossible for OUTPUT numbering to drift away from INPUT numbering, regardless of what the source formatting looks like.

Proof

I didn’t jump straight into a full run. I validated the fix in isolation first, a small harness that calls the labeling function directly against three cases: normal format, missing first label, and continuation from a higher turn number. Only after the harness proved the contract held did I rerun the full pipeline and spot-check the cleaned PDF from beginning to end.

The output stayed aligned. The labeling read sharper because it was finally consistent.

That’s what closed the loop: not “it seems fixed,” but the invariant proven in isolation, then proven again end-to-end.

What This Log Is Really About

This was a quiet failure mode, a system that runs fine, produces output, and misleads you the whole time.

The takeaway is simple: if an artifact looks slightly wrong, don’t argue with it and don’t patch randomly. Trace backward until you find the first layer where the contract breaks. Fix the smallest layer that owns the contract. Prove it in isolation. Then reintegrate.

That’s how you stop a system from merely running and start making it trustworthy.