Brent Fowler

Posted on Jun 1

The AI Context Efficiency Experiment: Why Architecture Beat Context Size

#ai #architecture #devops #productivity

Figure 1: The experiment's central turn: context, compaction, locality, governance, recovery, and conclusion.

The Question Was About Context

I thought I was running a context experiment.

I wasn't.

I just didn't know it yet.

The experiment started with a question that seemed straightforward.

It turned out to be an incomplete question.

What actually drives AI-assisted software development efficiency?

That is the question most engineers eventually ask after using AI coding tools for serious work. At first, the obvious variables all live inside the session. How much context is available? How quickly does it burn? What happens after compaction? Does the agent remember enough to keep working? Can a long-running session survive handoffs, summaries, and interruptions?

Those were the first things worth watching. The work was happening in two real repositories, not in a synthetic benchmark. Financial Portfolio Agent (FPA) and Market Intelligence Lab (MIL) had different responsibilities, real validation requirements, and established repository boundaries. FPA handled household execution realism, portfolio intelligence, survivability positioning, and deterministic financial decision support. MIL handled systemic context, macro intelligence, evidence quality, and source lineage governance.

The early mental model was simple: context is fuel. If the session burns too much of it, efficiency drops. If compaction preserves enough state, efficiency survives.

That model was not wrong. It was just incomplete.

The experiment eventually produced a recovered dataset with 47 master observations: 14 context observations, 13 compaction events, 6 feature batch observations, 7 model observations, and 12 recovery observations. That made the story evidence-backed early. It was not just a feeling that the work had changed shape. There was enough telemetry to trace where the explanation started to move.

Figure 2: Recovered observation counts across context, compaction, feature batch, model, and recovery logs.

By the end of the experiment, the strongest preserved signal was not context size. It was architectural locality. The surprising result was not that compaction worked. The surprising result was that the system became easier for AI to work in when the work had a clear architectural home.

The experiment started as a context study. It ended as an architecture study.

The First Clue Was The Burn Pattern

The first useful evidence was numerical. Context values were preserved across the experiment, and later reconstructed into a dataset.

FPA had recovered context observations including 26K, 66K, 100K, 133K, 161K, 194K, and 219K. MIL had recovered observations including 38K, 50K, 125K, 153K, 166K, 179K, 192K, 205K, 217K, 228K, and 240K. MIL also preserved a cleaner Generation 2 rebuild sequence that ran from 42.0K through 205.0K.

At a glance, those numbers look like the whole story. The context grows. The session gets heavy. Compaction resets it. Work continues.

But the data had to be handled carefully. The recovered FPA values were observed markers, not a complete time series. Some MIL values were ordered, but not all of them were. Exact timestamps were not preserved for every point. The dataset correctly keeps unknown values as UNKNOWN instead of pretending that a partial transcript is a complete measurement system.

That distinction matters. Clean charts can lie when the source data is incomplete. The right move was to preserve the observations as observations: enough to tell a technical story, not enough to claim a complete benchmark.

Figure 3: Recovered context observations. The MIL Generation 2 sequence is ordered; FPA values are preserved as observed markers.

The duration data had the same character. The transcript preserved observed work times from 1m 00s through 6m 27s, but not a full model-to-duration map for every feature. Again, that made the data useful telemetry, not a controlled comparison.

Then The Dataset Had To Be Rebuilt

One of the most important editorial discoveries came before the main technical one: the archive had preserved findings, but not all of the observations behind them.

That forced a reconstruction pass. The transcript became a primary source. The goal was not to summarize it. The goal was to recover telemetry: context values, compaction events, model observations, feature batch observations, duration observations, and recovery events.

The scorecard above is the result of that pass. It moved the work from memory to a defensible dataset.

These are recovered observations, not complete experiment totals. That boundary is the reason the dataset is useful: it says what was preserved and what was not.

Some of the most valuable values were transcript-only. FPA preserved a compaction transition from 219K to 26.7K. MIL preserved one from 240K to 38K. MIL also preserved a Generation 2 rebuild table from 42.0K through 205.0K. The archive preserved the broader finding that compaction supported continuity, but the transcript gave the story numbers.

That reconstruction changed the tone of the experiment. It made the next turn harder to dismiss.

Compaction Worked

Compaction passed an important test.

FPA moved from 219K to 26.7K. MIL moved from 240K to 38K. MIL also preserved a related compaction shrink observation of 239K -> 42K, followed by a rebuild span of 42K -> 205K.

That is a real continuity story. A long-running AI development process can shed a large amount of context and still keep going if enough operational state lives outside the chat. Repository artifacts, bootstrap notes, governance records, validation commands, and durable documentation all matter.

Figure 5: Compaction visibly reset context size in both repositories, while exact compaction counts remain unknown.

But this is where the experiment started to turn.

If compaction were the whole explanation, then the main lesson would be operational: summarize better, compact cleanly, restart carefully. Those are good practices, but they did not explain the strongest throughput signal in the archive.

Compaction helped the work survive. It was not preserved as the dominant throughput driver.

That distinction prevented the wrong lesson. The experiment was not saying, “The bigger the context window, the faster the work.” It was not saying, “Compaction is the answer.” Compaction created room to continue, but it did not explain why some work kept feeling cheaper to understand.

The work stayed efficient when it stayed local.

The Turn: Locality Became The Signal

The word locality became the center of the experiment because it explained what context size alone could not.

The turning point was subtle. There was no single feature where the experiment suddenly announced a new theory. Instead, the same pattern kept showing up in the work. Adjacent features were easier to complete when they stayed near previous features. Review was easier when the files, tests, and concepts were already part of the active working set. Validation stayed narrower when the feature did not cross into portfolio logic, report contracts, or runtime behavior.

That pattern was more interesting than the raw context number. A session at a high context count could still move well if the next task belonged to the same architectural neighborhood. A freshly compacted session could still struggle if the next task required reloading too many unrelated domains. The amount of context available mattered, but the shape of the work mattered more.

When work stayed inside a coherent architectural area, the AI agent had less to rediscover. The relevant files were close together. The concepts were reused. The tests were predictable. The risk surface was bounded. The reviewer did not have to reload the entire system to understand whether the change belonged.

In FPA, the clearest example was Pipeline Metadata. Repeated work clustered around a small set of files:

app/pipeline/registry.py
app/pipeline/planner.py
app/pipeline/runner.py
tests/test_pipeline_registry.py
tests/test_pipeline_planner.py
tests/test_pipeline_runner.py

That cluster supported features around discoverability, ownership visibility, dependency visibility, output visibility, lineage visibility, category inspection, and boundary inspection. Those features were useful, but they did not require report redesign, CSV schema changes, portfolio logic changes, or runtime execution redesign.

This is the architectural lesson: an AI agent is not only spending context on code. It is spending context on uncertainty.

When ownership is unclear, context burn rises. When the validation path is unknown, context burn rises. When a feature crosses multiple conceptual boundaries, context burn rises. When the work sits inside a mature locality cluster, the agent can reuse the same mental map.

The archive does not prove a numeric locality multiplier. It does not show that locality reduced token burn by a specific percentage. The claim is more careful: within the preserved experiment record, architectural locality appeared to be the dominant efficiency multiplier.

Figure 6: The context question turned into an architectural locality finding.

That was the point where the experiment stopped being mostly about context windows.

It became about codebase shape.

Bounded Contexts Made The Work Cheaper To Think About

Once locality was visible, the next question was obvious: what made some areas local enough to keep producing leverage?

The answer was bounded contexts.

Pipeline Metadata was the clearest FPA case. It had coherent ownership, deterministic read-only query behavior, stable adjacency across feature batches, and a clear separation from pipeline execution. It made the system easier to inspect without redesigning how the system ran.

That difference is subtle but important. A bounded context is not just a directory. It is a place where related questions can be answered without pulling the whole system into view.

Pipeline Metadata answered questions like:

What owns this output?
What depends on this category?
Which modules participate in this pipeline area?
Which outputs are shared?
What downstream categories are affected?

Each new inspection feature made later inspection features easier. The work compounded because it stayed near its own prior work.

The experiment record also identifies Export Quality Hardening as a bounded context in MIL. The FPA archive does not contain enough MIL internal evidence to analyze that workstream at the same level of detail, so that boundary needs to stay explicit. The experiment record can say Export Quality Hardening emerged as a bounded context. FPA-local evidence cannot prove MIL internal structure.

The bigger lesson is that bounded contexts changed the economics of AI-assisted development. The work became easier to start, easier to validate, easier to review, and easier to resume after interruption.

That is a different kind of efficiency than raw speed. It is structural efficiency.

Governance Was Not Paperwork

Once the work had a shape, the repository needed a way to preserve that shape.

That is where governance entered the story.

FPA formalized Feature Tracking Governance. Features were no longer just completed and forgotten. They were classified by category, complexity, burn, leverage, locality, architectural impact, and reassessment value. Work was organized into batches of four completed features before reassessment.

That sounds procedural until a long-running AI session is involved. Then it becomes memory infrastructure.

The governance artifacts made the experiment less dependent on conversational memory. They gave future sessions a way to recover what had happened, why it mattered, and where the next work should stay localized. They also helped preserve the difference between confirmed findings, strong evidence, and open questions.

The smcpp lifecycle served a similar role. Stage, commit, push, PR, merge, and prune became a named operational convention. In ordinary solo development, that might look like hygiene. In this experiment, it became part of survivability.

The confirmed finding was that governance and discoverability improvements compounded future leverage. That is exactly what happened when Pipeline Metadata kept accumulating inspection surfaces. The system became more legible as work progressed.

Figure 7: The evidence tiers are part of the result. Confirmed findings, strong evidence, and open questions stay separate.

The Model Result Was Interesting For A Different Reason

The experiment observed GPT-5.5 Reasoning Medium and GPT-5.4 Reasoning Low.

The tempting conclusion would be to turn that into a model horse race. That would be the wrong reading.

The preserved record does not prove that GPT-5.4 Low was more efficient than GPT-5.5 Medium. That specific claim was explicitly rejected as unproven. There was no benchmark-grade dataset with complete token usage, latency, defect rate, review burden, and quality measurements across both models.

The useful observation is narrower and more architectural: lower reasoning settings performed better than expected inside mature bounded contexts.

That is worth paying attention to. If a repository is shaped well enough, the model may need less reasoning power to make useful progress. Not because the task is trivial, but because the environment has fewer unresolved questions. Ownership is clearer. Tests are closer. Patterns are established. The blast radius is smaller.

This does not prove model equivalence. It suggests a better research question:

What repository conditions make lower reasoning settings viable?

That question is more useful to engineering leaders than a generic model ranking. It points back to architecture.

Then The Repository Broke

The experiment also included a power-loss Git corruption event during active development.

By then, the experiment had already moved from context to architecture and governance. The outage tested whether that shift mattered under pressure.

The recovery narrative preserved non-destructive repair, branch/ref recovery, working-tree preservation, validation rerun, PR merge, and cleanup. MIL recovery status preserved repository healthy, expected branch, remote aligned, working tree clean, validation passed, and API surface restored. The broader experiment record preserves successful recovery of both repositories.

Figure 8: The power-loss event turned process discipline into a real recovery test.

This mattered because the outage tested the experiment’s claims under pressure. Governance is easiest to dismiss before something goes wrong. After corruption, the value of recoverable state, validation discipline, branch hygiene, and bounded work becomes much less theoretical.

The archive does not preserve exact elapsed recovery time. It also does not prove which governance element mattered most. Branch policy, validation policy, bootstrap artifacts, lifecycle discipline, and bounded-context locality may all have contributed.

The confirmed finding is narrower: strong process discipline improved disaster recovery, and recovery success depended heavily on repository governance and validation discipline.

That is enough.

What The Experiment Actually Supports

The most important thing about this experiment is not the most dramatic story. It is the evidence discipline.

Confirmed findings:

Architectural locality appeared to be the dominant efficiency multiplier.
Governance and discoverability improvements compounded future leverage.
Feature Tracking Governance became necessary.
Pipeline Metadata emerged as a bounded context.
Strong process discipline improved disaster recovery.
Recovery success depended heavily on repository governance and validation discipline.

Strong evidence:

Low-to-medium burn work is more likely inside mature locality clusters.
Repeated Pipeline Metadata features produced compounding leverage.
Compaction was more valuable for continuity than throughput.
Lower reasoning settings became more viable inside mature bounded contexts.

Open questions:

How much gain came from locality versus familiarity?
What exact differences existed between GPT-5.5 Reasoning Medium and GPT-5.4 Reasoning Low?
When should Pipeline Metadata be extracted?
How structurally similar was MIL Export Quality Hardening to FPA Pipeline Metadata?
What exact context-window sizes, timestamps, and model mappings were not preserved?
Which governance element mattered most during recovery?

Figure 9: The final claims are strongest when the article keeps evidence tiers visible.

This separation keeps the story honest. It lets the confirmed findings stay strong because they are not carrying unsupported claims.

The Lesson For Engineering Leaders

The practical lesson is not “buy more context.” It is not “compact harder.” It is not “use the biggest model for everything.”

The lesson is that AI-assisted development efficiency depends heavily on the shape of the system around the model.

A repository that is local, discoverable, governed, and validated gives the agent less ambiguity to resolve. It turns memory into artifacts. It makes compaction survivable because operational truth is not trapped in the chat. It makes lower reasoning settings more plausible because the problem surface is better bounded. It makes recovery more realistic because the work has structure outside the session.

For senior engineers and architects, the implication is direct: if you want better AI-assisted development, do not only tune prompts. Tune the architecture.

Create bounded contexts. Keep ownership visible. Make dependencies inspectable. Preserve validation paths. Record feature batches. Separate confirmed findings from strong evidence. Treat governance as part of the engineering system, not as administrative residue.

The most interesting outcome wasn't discovering that context mattered.

Most engineers already suspected that.

The interesting outcome was discovering that architecture appeared to matter more.

The repositories that became easier to understand, validate, inspect, and recover also became easier for AI to work within.

That observation may ultimately be more valuable than any individual context-window measurement.

The experiment began with context growth, context burn, compaction, and survivability. Those were real concerns. They still matter.

But the deeper result was that context efficiency appeared to be downstream of architectural clarity.

The experiment started as a context study.

It ended as an architecture study.

If you're experimenting with AI-assisted development workflows, I'd be interested in hearing what you've observed.

DEV Community