Debug Log #1 — The Pipeline That Looked Broken

#python #debugging #etl #instrumentation

I had been building a local ETL pipeline designed to process long conversational PDFs into structured datasets. The system extracted dialogue, cleaned it, generated QA artifacts, and loaded the results into SQLite for downstream analysis.

By the time this debugging process started, the core extract-transform-load flow already worked. Data could move end-to-end through the system successfully.

The problems started showing up once I added the QA and diagnostics stages around it. During development, parts of those systems seemed to work. But when I came back and reran the full pipeline, execution would appear to stop somewhere around diagnostics. Long stretches of silence. No artifacts I could confirm. No database state I fully trusted.

At that point I knew something was wrong, but I didn’t yet have enough experience to understand what kind of wrong it was. I tried a few patches based on the first advice I got, potential path issues or output mismatches. None of them solved it. So the project sat for a while. I had to spend time away from the build learning how to read my own codebase, trace execution, and navigate the system well enough to come back and debug it properly.

Initial Understanding

When I came back, I started with a simpler question: was the pipeline actually broken, or did it only look broken because I couldn’t see what was happening?

Part of what triggered that question was staring at the terminal during long runs and realizing the process was still alive even though nothing visible was happening. At the time I was still learning basic operational ideas like what a “hang” even meant in practice. I had been treating long silence like proof that the system was dead, when in reality some pipeline states are just slow, blocked, waiting, or stuck behind expensive work.

That reframe changed the investigation immediately. Instead of treating it like one giant broken object, I started seeing it as a chain of expectations between stages. One script writes outputs. Another script expects those outputs somewhere specific. One stage assumes a schema already exists. Another assumes a naming format already matches. If those assumptions drift even slightly, the whole pipeline can look broken from the outside even when parts of it are still functioning correctly.

So the debugging process became: isolate the stopping point, check what was actually produced, compare it against what the next stage expected, then narrow the mismatch before changing anything.

Runtime Visibility

Earlier I had added a short timeout during debugging attempts, but I eventually realized the timeout logic was only surfacing a warning state, not actually terminating the process itself. The run would hit QA and diagnostics, I’d see the timeout, and everything after that seemed to disappear.

Instead of patching again, I started watching the runtime more carefully. I checked CPU and RAM usage to see whether the process was actually dead or just slow under load. I watched where execution appeared to stall. Then I did the thing I usually avoid during long runs: I waited long enough for the system to reveal more information on its own.

That changed the picture completely. The timeout message was not the same thing as “the pipeline stops.” It was just one event earlier in the run. Once I let the process continue, the pipeline moved into later stages successfully. The question stopped being “why does the pipeline die at QA?” and became “what is actually happening after QA finishes?”

Eventually the runtime progressed far enough for the real failure to surface:

sqlite3.OperationalError: no such column: missing_input

That changed the debugging process again. Now the issue was no longer vague runtime ambiguity. There was a concrete failure tied to a specific schema mismatch much later in execution. The pipeline was running farther than it looked. The runtime visibility had just been too weak to make that obvious earlier.

Instrumentation and Tracing

Once I understood the pipeline was continuing much farther than I originally thought, I stopped treating it like a mysterious crash and started measuring directly.

I added timing instrumentation at the stage boundaries. The ambiguity disappeared immediately. Diagnostics wasn’t “a little slow.” It was taking around fifty minutes:

[TIMING] diagnostics stage completed in 3036.21s

At that point the problem stopped feeling random. One part of the system was repeatedly doing expensive work. So I followed the runtime cost from stage to module to function to loop, until the slowdown had a specific address.

That narrowing led to write_missing_outputs_csv, which accounted for nearly the entire diagnostics runtime.

The Bottleneck

Tracing deeper into that function showed repeated calls to extract_pdf_context(...) inside the row-level loop. Following the call chain into quality_utils.py confirmed what was happening: the function was calling pdfplumber.open(pdf_path), iterating through every page, and rebuilding the extracted text from scratch, then doing the exact same thing on the next row.

Before changing anything, I added a counter to verify the scale of it directly:

extract_pdf_context called: 82

Then the math:

~3081 seconds total / 82 calls ≈ 37.5 seconds per call

At that point the issue stopped being a hypothesis. I knew exactly how to reason about a fix.

The Structural Refactor

The fix was about changing the shape of the system so the work couldn’t repeat itself.

The workload changed from:

rows × (open PDF + parse all pages)

to:

(open PDF + parse all pages once) + rows × (string search)

In practice: a new function extract_full_pdf_text() opens and parses the PDF one time before the loop and returns the full text as a string. A second function extract_pdf_context_from_text() takes that cached string and does a lightweight search against it, no file I/O, no page iteration. Inside the loop, only the search runs. I applied the same refactor pattern across both QA and diagnostics stages, then removed the temporary profiling scaffolding and kept the durable timing instrumentation that was still useful operationally.

First Clean Full Run

Up until this point, one of the hardest parts of debugging the pipeline was not being able to tell whether it was actually finishing at all. Long stretches of silence made the runtime feel ambiguous. It looked dead, stalled, or half-working. I couldn’t fully trust what I was seeing.

Then I finally got a clean full run. The important part wasn’t just that the pipeline completed. It was that the system explained itself clearly at the end:

811 total inputs
770 outputs
0 missing inputs
41 missing outputs
94.94% coverage

The SQLite load completed with 811 rows. The database-side missing outputs count matched what the diagnostics and CSV-side reporting were showing. Earlier in the investigation, different artifacts often contradicted each other and created more ambiguity. Now the outputs were reinforcing each other.

The full run took around 470 seconds, just under eight minutes. That reframed a lot of the earlier fear around the pipeline hanging. Now I had a runtime I could measure and reason about.

The pipeline was no longer a broken system somewhere in the middle of execution. It had become a system that could complete end-to-end, validate itself, load the database successfully, and expose the remaining problems as specific issues instead of vague uncertainty.

What Changed

Before this investigation, debugging felt mostly reactive. Change something, rerun it, hope the behavior improves. This process introduced a different sequence: wait long enough for the system to reveal itself, instrument at the boundaries, follow the runtime cost by layers, verify assumptions with concrete measurements, then refactor the structure rather than the logic.

The lack of observability wasn’t a side issue. It was half the problem. Once timing existed at stage boundaries, the pipeline stopped feeling like a black box and started feeling like something I could reason about directly.