DEV Community: Jovann Thompson

Trace Log #2 — Closing the Loop on OpenBB (Part 2 of the OpenBB Trace Logs)

Jovann Thompson — Fri, 17 Jul 2026 03:33:09 +0000

The first trace stopped at the provider pipeline.

I had confirmed the inner pipeline. fetch_data orchestrated transform_query, aextract_data, and transform_data, and one real Apple balance sheet made it all the way through Financial Modeling Prep and back as a typed FMPBalanceSheetData object. But I knew what I hadn't verified. The QueryExecutor had never run in front of me. I had read the code that handled registry resolution and provider selection. I hadn't watched either happen.

This trace closes that gap, following one request from the moment it enters OpenBB to the pipeline I already understood. The first trace showed how data moved through the provider layer. This one traces how execution reaches it.

Reopening the investigation

About a month had passed. The first thing I knew to do was reopen the environment and try the same query again. It failed immediately.

The traceback pointed into query_executor.py, credentials were being filtered before execution continued, the FMP key hadn't persisted. Working through what could actually cause it, with AI helping me interpret the traceback, brought me back to something the first trace had already taught me: environment issues are part of the investigation.

Checking which files were actually running turned up something I wasn't expecting.

Three copies of openbb_core existed on the machine: the .venv site-packages copy, which queries actually use, a global copy pytest was pulling from instead, and the repo source itself, which wasn't running at all. The .venv install had silently changed from editable to a plain static copy sometime over the month. Edit the source now, and nothing happens. What actually triggered that change, I never pinned down.

Once the key was restored and I knew which copy was real, the same Apple balance sheet came back. I had a working baseline before adding instrumentation.

Reading the orchestration layer

The first trace started by opening files and reading them top to bottom, that was the mistake. This time, I didn't repeat it, I started from a point I already trusted. fetch_data was already confirmed as the pipeline's entry point. What had to happen before fetch_data could run was what I needed to trace next.

The trail led into command_runner.py, in the application layer.

This code wasn't changing data. It was deciding what should run and handing work off, and that's what I followed: who does this hand work to next?

The chain resolved cleanly on paper. run() receives the request and resolves the command from its route. _execute_func() builds and validates the parameters next. _command() is small, its only job is to execute the resolved function and stamp the answering provider onto the result. Inside that call, maybe_coroutine runs the command function, and execution enters the three-stage pipeline I had already confirmed during the first trace.

None of that was confirmed yet. It still had to run.

Watching the chain execute

Using Cursor, I added DEBUG TRACE probes at run, _execute_func, _command, and around maybe_coroutine, placed above the existing pipeline probes. This time I knew exactly where they needed to live, in the .venv site-packages copies, the only ones that actually execute.

Then I ran one real request for Apple's balance sheet.

The trace fired in order. run logged the route. _execute_func validated the request. _command logged the provider it resolved: fmp, the piece I'd only inferred a month earlier. maybe_coroutine ran the command function. Then the pipeline from the first trace ran: transform_query built the request, aextract_data hit Financial Modeling Prep and returned one record, transform_data turned it into a typed FMPBalanceSheetData object. _command logged the provider stamped onto the result. run returned the completed OBBject.

The map, now complete

The status diagram from the first trace still had a large gray section above the provider pipeline.

After this run, it didn't.

The architecture I had inferred a month ago was now confirmed through live execution, against the Financial Modeling Prep balance sheet pipeline.

I confirmed the architecture-level execution flow through one provider implementation. I did not trace every provider OpenBB supports. This also reflects the version installed in that .venv at the time, OpenBB was mid-migration during this investigation.

Checking the trace against the documentation

After the trace was finished, I read OpenBB's architecture documentation. The execution flow was already there, the command runner, the query executor, provider resolution, the transform-extract-transform pattern.

I'd built my own model first, from the code and from watching it run. Reading the documentation first would have meant confirming someone else's understanding instead of building my own. Doing the trace first meant I had something real to compare against theirs, and they matched.

What this trace produced

The OpenBB architecture was the subject. Learning when to follow data and when to follow execution was the result.

Trace Log #1 — Tracing OpenBB’s Provider Pipeline (Part 1 of the OpenBB Trace Logs)

Jovann Thompson — Thu, 18 Jun 2026 03:14:10 +0000

The original goal was simple: figure out what OpenBB actually does and how it works internally. From the code.

My first approach was to start tracing individual files. I found the provider subsystem, the abstract fetcher layer, query parameter models, and provider implementations and began reading directly. This produced confusion faster than understanding. I was collecting files without knowing how they connected. The mistake was trying to trace execution before establishing any orientation at all.

The first real lesson of this investigation came from that failure: orient before you trace. Understand the platform promise, identify the major components, build a flow hypothesis, and only then start following execution. Without orientation a repository is just disconnected files. With orientation the files become parts of a system.

Orienting to the System

The orientation phase started with OpenBB’s stated purpose: connect once, consume everywhere. That framing revealed something useful immediately. OpenBB is not primarily an analysis or visualization tool. Its core responsibility is connecting to multiple external financial data providers and presenting their outputs through a single consistent interface. The analysis layer sits on top of that. The integration layer is what makes it work.

At a high level the platform has four major areas: Core, the Provider Layer, an API Layer, and an Application Layer. The Provider subsystem became the investigation focus because it sits closest to the platform promise. My working hypothesis going into the trace was straightforward. OpenBB connects to external providers, retrieves provider-specific data, normalizes it into standard models, and returns consistent outputs regardless of which source answered the request.

This orientation phase also introduced a distinction that made everything else easier to reason about. Some components in the subsystem are passive. They define contracts, schemas, and models. They describe the system. Others are active. They perform work at runtime. They move data through the system. The Registry, the Registry Map, the Abstract Contracts, and the Standard Models are passive. The QueryExecutor, Provider, and Fetcher are active. Keeping that separation clear prevented a lot of confusion during the trace.

The Static Trace

With a system map in place I began tracing execution without running anything.

The most important realization during this phase had nothing to do with OpenBB specifically. Execution order is not file order. Repositories are organized for humans. Runtime execution is organized around calls. Reading files sequentially produces a misleading picture of how a system actually behaves at runtime. I shifted my focus away from what file comes next and toward what function gets called next. That change made the static trace possible.

Following the call chain through the provider subsystem produced a working hypothesis. A user request enters the QueryExecutor. The executor consults the Registry to resolve the provider, consults the Provider to resolve the fetcher, validates credentials, and hands off to the fetcher. The fetcher runs a three stage pipeline: transform the query, extract the data, transform the output. The result comes back as a standard model wrapped in an annotated result.

At this stage the trace existed entirely on paper. The goal was not to prove behavior but to predict it precisely enough that running the code would either confirm or contradict the model.

Setting Up the Environment

Before runtime verification could happen, a working local environment was needed. This involved installing OpenBB locally, resolving dependencies, verifying package imports, running provider tests, and confirming the codebase could execute at all.

Several of the most significant discoveries came from environment and installation questions rather than application logic questions. Getting the environment right was not a setup step I moved past. It kept surfacing as part of the investigation itself.

Adding Instrumentation

To validate runtime behavior I added trace statements to the execution path. The assumption going in was that modifying the repository source would immediately produce output. That turned out to be wrong.

The first instrumentation attempt produced nothing visible in the terminal. The issue wasn’t the instrumentation itself. The assumption about where to place it had never been verified. I had directed the tooling without first confirming the exact file and layer the runtime was actually executing. The lesson that came out of that was about drilling down on placement assumptions before trusting any instrumentation output. If you’re not certain where the code runs, you’re not certain what your traces mean.

After confirming the instrumentation was in the right file, output still failed to appear during test runs. Further investigation revealed that pytest captures stdout by default. The trace statements were executing, but pytest was intercepting the output before it reached the terminal. Running tests with the -s flag exposed everything and confirmed the instrumentation had been working correctly the entire time.

That second discovery was its own kind of lesson. Debugging infrastructure can become part of the investigation. The tooling around the system is as important to understand as the system itself.

Running and Verifying

With instrumentation active and the environment confirmed, runtime verification began.

The initial execution produced a 401 Unauthorized response from Financial Modeling Prep.

That failure was informative rather than discouraging. It proved that execution had successfully traveled all the way through the QueryExecutor, through the provider resolution, through the fetcher, and reached the external provider boundary. The architecture had worked. Authentication was the only thing that stopped it.

After obtaining a valid FMP API key and configuring credentials, the trace ran again. This time it completed.

The output showed every stage firing in sequence: fetch_data orchestrated the pipeline, transform_query ran and validated the input, aextract_data hit FMP’s servers and returned data, transform_data normalized the raw JSON into a typed FMPBalanceSheetData object. One real Apple balance sheet record came back.

The runtime flow matched the static model almost exactly. The prediction held.

What the Trace Confirmed and What Remains

The investigation validated the provider execution pipeline end to end. Query transformation ran. External API retrieval worked. Data normalization produced a standard model. Watching real AAPL balance sheet data move through the pipeline confirmed what the architecture was supposed to do.

Several areas remain outside the scope of this trace. The QueryExecutor was not verified live. Registry resolution mechanics, provider selection logic, the AnnotatedResult lifecycle, and the API and Application layers were not traced. Those become the next investigation targets.

What stayed with me after this wasn’t only the OpenBB architecture. It was what the process showed me. The static trace only meant something because there was a prediction to test when the code ran. The runtime output only confirmed something because the model existed to compare it against. One without the other would have been either guessing or observing without context. Together they closed a loop I didn’t know I was trying to close when I started.

That cycle, read the code, form a prediction, run the code, compare reality to the prediction, is the process this investigation produced. The OpenBB provider subsystem was the subject. Learning to trace unfamiliar systems was the result.

Systems Primitives: A Practical Software Systems Reading Framework

Jovann Thompson — Wed, 27 May 2026 01:09:45 +0000

While debugging my first software system, I kept running into the same problem: I could see failures happening, but I couldn’t consistently explain them. Sometimes the pipeline stalled. Sometimes artifacts looked correct while downstream stages failed anyway. Sometimes local fixes worked briefly, then broke again under slightly different conditions.

At first, every issue felt like a separate bug. Over time I realized the deeper problem was that I didn’t yet have a stable way to read the system itself. I needed a way to answer questions like: what is this stage actually responsible for? Where does this behavior originate? What assumptions exist between components? What kind of failure is this?

So during the process of learning how to navigate my own codebase more seriously, I started organizing a small set of recurring concepts that helped reduce ambiguity while debugging. It’s a practical reading framework, a set of primitives that helped me reason through real software behavior more coherently.

The Core Problem

One of the hardest parts of early debugging was that everything collapsed together. A timeout looked like a crash, a schema mismatch looked like a database failure, a slow stage looked like a dead process. Without structure, every symptom felt disconnected, which led to reactive debugging and making patches without fully understanding if the fix was right.

The problem with that approach is that it treats symptoms independently instead of locating the actual responsibility layer. What finally started helping was separating the system into smaller reasoning categories to reduce confusion.

The Primitives

Promise

A system has to be understood in terms of what it’s trying to accomplish. The promise defines the intended result. Without a clear promise, it’s difficult to classify failures because there’s no stable definition of correct behavior.

In the ETL pipeline, the promise was simple: transform raw PDF conversations into structured, traceable data. That immediately separates extraction failures from transformation failures from storage failures from reporting failures.

It also clarified why the off-by-one labeling bug mattered so much later. The system was still producing output, but once INPUT/OUTPUT numbering drifted, the conversation became harder to trace reliably across stages. The pipeline was operationally running while violating part of its core promise: preserving coherent conversational structure.

Boundaries

Boundaries define ownership, what the system controls versus what the system depends on. This became important once the pipeline started interacting with external libraries, PDFs, SQLite, filesystem paths, and downstream visualization scripts.

Without boundaries, debugging turns into blame diffusion. Every failure feels like it could belong to any layer. A concrete example of this came when I added graceful degradation and tried to rerun the pipeline against a different PDF. The run failed, but the failure wasn’t in my system. The PDF hadn’t been uploaded correctly and pdfplumber couldn’t parse the structure. Without a clear boundary in my head, I could have spent hours assuming my pipeline was broken. Once I understood where my system’s responsibility ended and the external dependency began, the real issue became obvious and I could think clearly about fallback logic instead of chasing a problem that wasn’t mine.

Flow

Flow describes how work moves through the system, ordering, branching, transformations, retries, stage progression. This became critical once the pipeline started looking dead. The runtime would reach diagnostics, go quiet, and appear frozen.

What made flow traceable was following execution through the orchestrator. The orchestrator was the spine of the pipeline, the place where every stage connected. By tracing the execution path through it, I could see which stages had actually run, which ones were still in progress, and where the handoff between them was breaking down. That turned a frozen-looking runtime into something I could follow step by step.

Contracts

Contracts are the assumptions shared between stages. One component produces something. Another component expects something. Those assumptions can involve schema, naming, ordering, file paths, formatting, or runtime behavior.

A major shift in my debugging happened once I stopped treating failures as isolated bugs and started treating them as broken contracts between stages. A downstream script expecting a column that upstream processing never created isn’t random failure. It’s a contract mismatch. That framing made debugging much more precise.

State

State answers: what is true right now? This became important because the pipeline often lacked durable runtime visibility. A stage might partially finish, silently fail, repeat work, or leave artifacts behind that looked valid even when execution was incomplete.

What helped was learning to check what each stage actually produced before moving on. Once I could see where a stage stopped and what it left behind, the picture clarified immediately. I could see everything the pipeline had generated up to a certain point, and then one specific artifact was missing or incomplete. That narrowed the problem from “something is wrong somewhere” to “this particular stage didn’t finish what it promised.” Without that visibility, I kept confusing “currently running” with “successfully completed.”

Invariants

Invariants are conditions that must remain true for the system to stay correct. One invariant in the pipeline was conversational turn alignment, INPUT 1 / OUTPUT 1, INPUT 2 / OUTPUT 2. When the cleaned output started producing INPUT 2 / OUTPUT 1, the pipeline still ran. Nothing crashed. But the invariant was broken.

That distinction exposed a different category of failure: quiet correctness drift. The system was operationally functional while structurally incorrect.

Constraints

Constraints are the limits the system must operate inside, runtime, memory, file variability, dependency behavior, data quality. One major debugging moment came after realizing diagnostics was taking nearly fifty minutes because PDFs were being reparsed repeatedly inside row-level loops. The issue wasn’t mysterious instability. The workload itself violated practical runtime constraints. Once the constraint became visible, the fix became much easier to reason about.

Failure Modes

Failure modes classify recurring break patterns. Instead of “something weird happened again,” the question became “what category of failure is this?” Contract mismatch, silent runtime drift, invalid state, partial extraction, repeated expensive work, schema divergence, hidden branching behavior. Naming the category made debugging cumulative instead of repetitive. The same patterns started reappearing in recognizable forms.

Guarantees

Guarantees define what the system can reliably provide under stated conditions, not ideal behavior, actual dependable behavior. In my pipeline that distinction became real fast.

The clearest example was labeling. The system was supposed to guarantee properly paired INPUT/OUTPUT labels from start to finish. But when I checked the cleaned output manually, the numbering was off from the very first turn. The pipeline implied it was producing correct structure. It wasn’t. Being explicit about what the system actually guarantees versus what it appears to guarantee forces realism and clarifies what downstream stages are actually allowed to trust.

One Real Failure Walkthrough

One of the clearest examples of these primitives working together happened during diagnostics debugging. The runtime appeared to freeze during QA and diagnostics processing. At first the symptom looked like a crash. Using the primitives changed the investigation entirely.

The promise said diagnostics should complete and produce visibility artifacts. Tracing flow through the orchestrator showed execution continued farther than expected. Examining state revealed that weak runtime visibility was making slow execution appear dead. Checking contracts showed downstream stages expected artifacts that hadn’t been fully validated yet. The constraint was the one that finally broke it open: repeated PDF parsing was creating severe runtime overhead, reopening and reparsing full PDFs inside row-level loops across 82 calls at roughly 37 seconds each.

The fix was structural. Parse once, cache the text, reuse lightweight searches. But the important part wasn’t the optimization itself. It was that the primitives reduced ambiguity enough to locate the real responsibility layer. Without that structure, the investigation would have kept bouncing between symptoms.

Limits of the Framework

This framework has real limits worth naming. The concepts overlap. Contracts often exist at boundaries, state transitions occur through flow, guarantees depend on constraints and invariants. They’re more like perspectives than isolated primitives.

It’s also strongest for engineered systems. It becomes weaker in environments dominated by incentives, politics, social dynamics, or human behavior that doesn’t follow a spec. And it isn’t predictive in any rigorous scientific sense.

What Changed

The biggest shift this framework created was moving debugging from reactive behavior toward structured reasoning. Before this, failures felt random. Afterward, systems became easier to decompose: define the promise, identify the boundaries, trace the flow, verify the state, locate the broken contract, identify the constraint, classify the failure mode, then fix the smallest responsible layer.

That sequence didn’t eliminate complexity. It made the complexity legible.

And honestly, that was the real transition. Not learning how to write software, but learning how to read systems well enough that failures stopped feeling like chaos.

The primitives in this framework came directly from building and documenting a real local ETL pipeline. system-envelope.md is the architecture doc where this thinking first took shape: github.com/Jt-Thompson

Debug Log #2 — The Off-By-One That Didn’t Crash (It Just Lied)

Jovann Thompson — Tue, 26 May 2026 03:57:20 +0000

I built a local pipeline to take long chat transcripts saved as PDFs and turn them into something structured, cleaned output where every conversational turn is rewritten into paired labels:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

That pairing is the contract. It’s what makes the transcript auditable instead of just scrollable.

The Symptom

When doing a last integrity pass, I opened the cleaned PDF to confirm the labeling holds from start to finish. But right at the beginning the artifact was telling me a different story:

INPUT 2 / OUTPUT 1
INPUT 3 / OUTPUT 2

The system was still alternating input/output, the output existed, the pipeline completed. But the numbering was shifted from the first turn. The system runs, the output exists, and the output is quietly lying by one. That lie ripples into every downstream count, integrity check, and assumption built on top of it.

The real question became: where is the first place the system starts lying?

Initial Confusion

At first I kept framing it as a counting issue, maybe something in the missing-input/missing-output analysis, maybe a reporting mismatch, maybe the integrity summary was slightly off. I didn’t want to rerun the entire dataset just to test a small correctness problem, so I tried to do it the right way: make a small sample input, isolate the stage, validate expected versus actual.

That immediately raised practical questions I couldn’t dodge. Where do I even inject a sample? If my entrypoint starts at PDFs, how do I test a mid-stage without breaking the whole flow? If I create a CSV, which CSV does the stage actually expect?

The framing itself was the problem. I was treating it like a reporting bug when it was actually a contract bug.

What the Bug Really Was

The system was never meant to count like:

INPUT 1, OUTPUT 2, INPUT 3, OUTPUT 4...

It was meant to preserve paired conversational turns:

INPUT 1 / OUTPUT 1
INPUT 2 / OUTPUT 2

So if the cleaned PDF starts at INPUT 2 / OUTPUT 1, the core failure isn’t in downstream analysis. The numbering contract is being violated somewhere upstream, and everything else is just inheriting the damage. Reframing it that way collapsed the search space immediately. Stop looking at reporting, trace back to wherever the labels get written in the first place.

The Trap I Almost Fell Into

Before that reframe landed, I tried to build a debug input using raw “you said / chatgpt said” style text, because that’s what I visually associate with the PDF source. But a test fixture only helps if it matches the contract of the stage you’re actually testing. Some stages in the pipeline don’t consume raw conversational text, they consume already-columnized CSV data. Feed the wrong-shaped input into the wrong layer and you’re not debugging the system anymore. You’re debugging a mismatch you created.

That was one of the real lessons of this log: if your mental model of the pipeline layers is even slightly off, you can do a lot of work that produces zero signal.

Tracing Back to the First Lie

The way it became solvable was tracing backward from the artifact I trusted until I found the first divergence.

Start with the cleaned PDF, numbering is wrong at the first turn. Work backward through the pipeline outputs and stage boundaries. At each boundary ask: is the numbering still correct here, or did it break here? The moment a layer is confirmed correct, stop blaming it and move earlier.

That tracing forced a clear outcome. The numbering wasn’t being broken by the analysis layer. It wasn’t something happening at the end. It was being introduced in the ingestion and cleaning step, the part of the system that writes the labels in the first place.

Root Cause

The offset wasn’t random drift. It was a systematic base shift baked in from the start.

I remembered why: sometimes when copying a thread, the first “You said:” label doesn’t exist the way the parser expects, so I had added logic to bootstrap the first input anyway. The intention was correct, recover from messy real-world formatting. But the implementation created a permanent misalignment. Input and output were being advanced out of sync at the very beginning, so everything after stayed consistently off by one.

The bug didn’t need to crash to be real. It just needed to violate the contract once.

The Fix

The fix was structural, not a patch. Instead of two separate counters drifting against each other, the labeling logic was rebuilt around a single turn counter that increments only when an INPUT is encountered or injected, labels OUTPUT using that same turn number, and ensures the edge-case injection doesn’t double-increment the first real turn. The goal was to make it structurally impossible for OUTPUT numbering to drift away from INPUT numbering, regardless of what the source formatting looks like.

Proof

I didn’t jump straight into a full run. I validated the fix in isolation first, a small harness that calls the labeling function directly against three cases: normal format, missing first label, and continuation from a higher turn number. Only after the harness proved the contract held did I rerun the full pipeline and spot-check the cleaned PDF from beginning to end.

The output stayed aligned. The labeling read sharper because it was finally consistent.

That’s what closed the loop: not “it seems fixed,” but the invariant proven in isolation, then proven again end-to-end.

What This Log Is Really About

This was a quiet failure mode, a system that runs fine, produces output, and misleads you the whole time.

The takeaway is simple: if an artifact looks slightly wrong, don’t argue with it and don’t patch randomly. Trace backward until you find the first layer where the contract breaks. Fix the smallest layer that owns the contract. Prove it in isolation. Then reintegrate.

That’s how you stop a system from merely running and start making it trustworthy.

Project

GitHub Repository:
https://github.com/Jt-Thompson

Debug Log #1 — The Pipeline That Looked Broken

Jovann Thompson — Tue, 26 May 2026 02:40:24 +0000

I had been building a local ETL pipeline designed to process long conversational PDFs into structured datasets. The system extracted dialogue, cleaned it, generated QA artifacts, and loaded the results into SQLite for downstream analysis.

By the time this debugging process started, the core extract-transform-load flow already worked. Data could move end-to-end through the system successfully.

The problems started showing up once I added the QA and diagnostics stages around it. During development, parts of those systems seemed to work. But when I came back and reran the full pipeline, execution would appear to stop somewhere around diagnostics. Long stretches of silence. No artifacts I could confirm. No database state I fully trusted.

At that point I knew something was wrong, but I didn’t yet have enough experience to understand what kind of wrong it was. I tried a few patches based on the first advice I got, potential path issues or output mismatches. None of them solved it. So the project sat for a while. I had to spend time away from the build learning how to read my own codebase, trace execution, and navigate the system well enough to come back and debug it properly.

Initial Understanding

When I came back, I started with a simpler question: was the pipeline actually broken, or did it only look broken because I couldn’t see what was happening?

Part of what triggered that question was staring at the terminal during long runs and realizing the process was still alive even though nothing visible was happening. At the time I was still learning basic operational ideas like what a “hang” even meant in practice. I had been treating long silence like proof that the system was dead, when in reality some pipeline states are just slow, blocked, waiting, or stuck behind expensive work.

That reframe changed the investigation immediately. Instead of treating it like one giant broken object, I started seeing it as a chain of expectations between stages. One script writes outputs. Another script expects those outputs somewhere specific. One stage assumes a schema already exists. Another assumes a naming format already matches. If those assumptions drift even slightly, the whole pipeline can look broken from the outside even when parts of it are still functioning correctly.

So the debugging process became: isolate the stopping point, check what was actually produced, compare it against what the next stage expected, then narrow the mismatch before changing anything.

Runtime Visibility

Earlier I had added a short timeout during debugging attempts, but I eventually realized the timeout logic was only surfacing a warning state, not actually terminating the process itself. The run would hit QA and diagnostics, I’d see the timeout, and everything after that seemed to disappear.

Instead of patching again, I started watching the runtime more carefully. I checked CPU and RAM usage to see whether the process was actually dead or just slow under load. I watched where execution appeared to stall. Then I did the thing I usually avoid during long runs: I waited long enough for the system to reveal more information on its own.

That changed the picture completely. The timeout message was not the same thing as “the pipeline stops.” It was just one event earlier in the run. Once I let the process continue, the pipeline moved into later stages successfully. The question stopped being “why does the pipeline die at QA?” and became “what is actually happening after QA finishes?”

Eventually the runtime progressed far enough for the real failure to surface:

sqlite3.OperationalError: no such column: missing_input

That changed the debugging process again. Now the issue was no longer vague runtime ambiguity. There was a concrete failure tied to a specific schema mismatch much later in execution. The pipeline was running farther than it looked. The runtime visibility had just been too weak to make that obvious earlier.

Instrumentation and Tracing

Once I understood the pipeline was continuing much farther than I originally thought, I stopped treating it like a mysterious crash and started measuring directly.

I added timing instrumentation at the stage boundaries. The ambiguity disappeared immediately. Diagnostics wasn’t “a little slow.” It was taking around fifty minutes:

[TIMING] diagnostics stage completed in 3036.21s

At that point the problem stopped feeling random. One part of the system was repeatedly doing expensive work. So I followed the runtime cost from stage to module to function to loop, until the slowdown had a specific address.

That narrowing led to write_missing_outputs_csv, which accounted for nearly the entire diagnostics runtime.

The Bottleneck

Tracing deeper into that function showed repeated calls to extract_pdf_context(...) inside the row-level loop. Following the call chain into quality_utils.py confirmed what was happening: the function was calling pdfplumber.open(pdf_path), iterating through every page, and rebuilding the extracted text from scratch, then doing the exact same thing on the next row.

Before changing anything, I added a counter to verify the scale of it directly:

extract_pdf_context called: 82

Then the math:

~3081 seconds total / 82 calls ≈ 37.5 seconds per call

At that point the issue stopped being a hypothesis. I knew exactly how to reason about a fix.

The Structural Refactor

The fix was about changing the shape of the system so the work couldn’t repeat itself.

The workload changed from:

rows × (open PDF + parse all pages)

to:

(open PDF + parse all pages once) + rows × (string search)

In practice: a new function extract_full_pdf_text() opens and parses the PDF one time before the loop and returns the full text as a string. A second function extract_pdf_context_from_text() takes that cached string and does a lightweight search against it, no file I/O, no page iteration. Inside the loop, only the search runs. I applied the same refactor pattern across both QA and diagnostics stages, then removed the temporary profiling scaffolding and kept the durable timing instrumentation that was still useful operationally.

First Clean Full Run

Up until this point, one of the hardest parts of debugging the pipeline was not being able to tell whether it was actually finishing at all. Long stretches of silence made the runtime feel ambiguous. It looked dead, stalled, or half-working. I couldn’t fully trust what I was seeing.

Then I finally got a clean full run. The important part wasn’t just that the pipeline completed. It was that the system explained itself clearly at the end:

811 total inputs
770 outputs
0 missing inputs
41 missing outputs
94.94% coverage

The SQLite load completed with 811 rows. The database-side missing outputs count matched what the diagnostics and CSV-side reporting were showing. Earlier in the investigation, different artifacts often contradicted each other and created more ambiguity. Now the outputs were reinforcing each other.

The full run took around 470 seconds, just under eight minutes. That reframed a lot of the earlier fear around the pipeline hanging. Now I had a runtime I could measure and reason about.

The pipeline was no longer a broken system somewhere in the middle of execution. It had become a system that could complete end-to-end, validate itself, load the database successfully, and expose the remaining problems as specific issues instead of vague uncertainty.

What Changed

Before this investigation, debugging felt mostly reactive. Change something, rerun it, hope the behavior improves. This process introduced a different sequence: wait long enough for the system to reveal itself, instrument at the boundaries, follow the runtime cost by layers, verify assumptions with concrete measurements, then refactor the structure rather than the logic.

The lack of observability wasn’t a side issue. It was half the problem. Once timing existed at stage boundaries, the pipeline stopped feeling like a black box and started feeling like something I could reason about directly.

Project

GitHub Repository:
https://github.com/Jt-Thompson