Peter Tamas

Posted on Apr 14

AI Field Notes #003 | When AI Reads Too Much: The Real Price of Complexity

#agents #ai #architecture #codequality

Let’s be honest: reading code is not always as straightforward as we would like. Even experienced developers know that some codebases take more effort to navigate than others. And now, AI has joined the same reality.

Turns out, when an AI agent walks through a messy codebase, it does not get tired. It gets expensive. Not in time, but in tokens. The more tangled the logic, the more it costs to figure out what is going on. Same confusion, different billing model.

That is where this tool comes in. Instead of letting packages pile up like an overambitious Jenga tower, it restructures them into a more balanced, layered system. The goal is simple: make the codebase easier to navigate, not just for developers, but for AI agents too.

Whether you are human or silicon, nobody enjoys digging through chaos. And if we can make code more readable for both, that is not just optimization. It is survival.

Source: https://bobcats-coding.notion.site/ai-field-notes-by-bobcats-coding

Goal: Structure node packages so AI agents read less and understand more.

Specifically: measure how the TypeScript monorepo structure affects context window consumption, and build a tool that quantifies the waste and fixes it.

Repository: markkovari/context-pnpm

Context awareness for node packages

Goal: Structure node packages so AI agents read less and understand more.

Specifically: measure how TypeScript monorepo structure affects context window consumption, and build a tool that quantifies the waste and fixes it.

Repository: markkovari/context-pnpm

Before/After Highlights

When I work on different parts of a codebase with AI assistants, the context window fills up fast. Every file the assistant reads to understand a dependency is loaded in full, including implementation details it will never touch. For a busy utility module, that's thousands of tokens of waste, on every session, across every file that imports it. I kept hitting conversation compacting earlier than expected, and it was slowing me down.

My theory was that the shape of your modules, how many packages you have, how big they are, how nested, directly influences how many tokens get burned just loading context. But I didn't have numbers. I didn't know the threshold where splitting a module actually pays off versus adding maintenance overhead for no gain.

So I built a tool to find out.

The approach

I wanted to answer a simple question: given a TypeScript codebase, which files are costing you the most tokens per AI session, and is it worth restructuring them?

The core insight is that file size alone doesn't predict waste. What matters is how much of a file is implementation versus exported API, multiplied by how many files import it. A 10,000-token type declaration file with 98% exports barely registers. A 700-token utility module with a large implementation body, imported by 18 files, costs more than almost anything else.

I landed on this scoring formula:

score = (total_tokens − surface_tokens) × importer_count

Term	Definition
`total_tokens`	Full file token count (tiktoken cl100k_base)
`surface_tokens`	Only the exported declarations
`importer_count`	Number of files that import this one

💡 If the score is above 60 (the overhead of a package.json + index.ts boilerplate), extraction into a separate workspace package is worth it. Below that, leave it alone.

The toolchain

External packages

Package	Purpose
tiktoken (OpenAI)	Accurate token counting with cl100k_base encoding
typescript-estree (typescript-eslint)	ESTree-compatible AST parser to distinguish exported surface from implementation body

Internal packages

Package	Role
`analyzer`	Reads folders via glob pattern, returns total tokens, surface tokens, and importer counts
`estimator`	Projects token savings per AI session from analyzer output
`cli`	User-facing tool: `analyze`, `estimate`, `scaffold`, `verify`, `rebalance`. Dry-run by default; nothing written without `--apply`
`scaffolder`	Rewires imports/exports, registers new pnpm workspace packages, generates minimal `index.ts` re-export surfaces

The process: traverse the module tree, tokenize each file, separate surface from implementation via AST analysis, count importers, score everything, and rank by extraction value. The CLI can then scaffold the actual package extraction if the numbers justify it.

Benchmarks

During development, I spoon-fed the tool its own internal packages as test cases and added synthetic fixtures for both extremes: a "symmetric" already-optimized codebase and an "asymmetric" monolith with classic shared-utility anti-patterns.

But the interesting part was running it against real-world open-source monorepos.

External package benchmarks

I ran dry-run estimations against three popular TypeScript repositories:

Codebase	Files	Candidates	%	Tokens saved / session
tRPC `packages/server/src`	89	56	63%	68,572
TanStack Query `packages/query-core/src`	31	20	65%	34,155
Radix UI `packages/`	131	5	4%	1,591

💡 Pricing reference: Claude Sonnet input at $3/1M tokens. The tRPC result means ~$0.21 in unnecessary tokens per session, which adds up across a team over weeks.

tRPC: the "deep internals" anti-pattern

tRPC's unstable-core-do-not-import/ is a textbook case. 56 files fan out to 2-18 consumers each. Every adapter file that an AI session reads drags in the full internals of the procedure builder, router, and streaming infrastructure, even when it only needs one or two types. The top offender, procedureBuilder.ts, scores 8,260: 4,386 tokens of implementation consumed by 5 importers. After extraction, each consumer would read only a ~200-token surface.

TanStack Query: tight coupling in a small graph

31 files, 20 heavily cross-imported. utils.ts is imported by 17 files and queryClient.ts by 13. The interesting finding here: types.ts is the largest file (10,521 tokens) but scores only fifth because 98% of it is surface. utils.ts scores second despite being half the size, because its implementation body is large relative to what callers use. File size is a bad proxy for waste.

Radix UI: the correct negative result

Only 5 candidates from 131 files, all with a single importer. Radix is already decomposed into ~30 packages with 1-5 files each and minimal internal coupling. The tool correctly says "nothing to do." This was an important validation: I needed to confirm it doesn't generate false positives on well-structured code.

Synthetic fixtures

Fixture	Files	Candidates	Tokens saved	Purpose
`monolith-service`	10	3	9,936	`db.ts`, `logger.ts`, `config.ts` each imported by every other module. Most common anti-pattern.
`decomposed-app`	6	0	0	Small focused files, 1-2 consumers each. Correct negative.

What surprised me

💡 The biggest finding: file size doesn't predict waste. R² ~ 0.15. Importer count alone is equally weak. The strongest predictor is hidden tokens (implementation body), but score is multiplicative (hidden x importers), so both dimensions matter.

This means you can't eyeball your way to the answer. A module that looks "big" might be mostly type exports and perfectly fine. A module that looks "small" might be silently burning thousands of tokens because it's imported everywhere and its public API is tiny compared to its internals.

What worked

The Claude Code hook integration turned out to be the most practical outcome. Wire estimate as a SessionStart hook and it automatically surfaces context bloat whenever you open a session:

{
  "hooks": {
    "SessionStart": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "npx context-pnpm estimate . 2>/dev/null | grep -E 'Total|No extraction'"
          }
        ]
      }
    ]
  }
}

If the codebase is clean, you see No extraction candidates. If it has drifted, you see the token savings waiting to be unlocked, before you've written a single line of code. This feedback loop keeps the team aware of structural drift without adding process overhead.

What didn't work (yet)

The scaffolder, while functional, is the least mature piece. It handles straightforward cases well: generating workspace packages with minimal re-export surfaces and rewriting import paths. But in codebases with circular dependencies or complex re-export chains, the rewiring logic still needs manual intervention. I'm treating this as a "preview" feature while I iterate on edge cases.

I also initially assumed I could use a simpler heuristic (just file size times importer count) and skip the AST-based surface detection entirely. The Radix UI and TanStack types.ts results proved that assumption wrong. Without distinguishing surface from implementation, the scoring would have flagged types.ts as one of the top offenders when it's actually fine.

Current status and next steps

The tool is open source and usable today for the read-only commands (analyze, estimate). The mutation commands (scaffold, rebalance) work but should be used with review.

Next steps I'm considering:

Adding support for JavaScript/JSX alongside TypeScript (partially done)
Making the scoring engine language-agnostic via Tree-sitter. The core formula is language-independent; I'd only need per-language definitions of "what counts as surface" (Python: __all__; Go: capitalized identifiers; Rust: pub items). The tree-sitter-language-pack bundles 248+ grammars with Rust/Node.js/Python bindings, so the plumbing is there.
A rebalance command that identifies merge/split/inline opportunities on existing workspace packages, not just extraction from monoliths
Better heuristics for the "when to extract" decision, incorporating churn rate from git history alongside the static score
Integration with CI pipelines so teams get warned when a PR pushes a module past the extraction threshold

I don't think AI coding assistants will solve the context window problem on their own. Models will get bigger windows, but tokens are never free, and the cost curve is multiplicative with team size. Structuring your code so that AI can read less and understand more is a lever that keeps compounding.

Writing code for two audiences

For decades, "clean code" meant code that humans can read and maintain. That's still true, but AI agents are now a second consumer of your codebase. They read your modules, trace your imports, and parse your exports on every session, from scratch, burning tokens the whole time.

The practices that help humans (small functions, clear separation of concerns) mostly overlap with what helps agents, but not entirely. An agent doesn't care about naming aesthetics. It cares about how many tokens it has to ingest before it can do useful work. A module with a 50-line public API and 2,000 lines of implementation behind it is perfectly clean by human standards, but it's wasteful for an agent that only needs the API.

context-pnpm is built around treating AI readability as a first-class design constraint alongside human readability. The two rarely conflict: narrow interfaces, minimal public surface, and well-decomposed modules are good for both. The difference is that now there's a measurable cost when you get it wrong: tokens per session, dollars per month. I think this will quietly become a standard part of how teams think about code architecture, not as a buzzword, but as a practical recognition that your codebase has two kinds of readers.

Old principles, new payoff: why SOLID matters for AI readability

Most of what makes code AI-readable isn't new. The Interface Segregation Principle (the "I" in SOLID) says no consumer should depend on methods it doesn't use - that's literally what the scoring formula measures. The Dependency Inversion Principle says depend on abstractions, not implementations - that's what extraction into a minimal re-export surface achieves. IDD formalizes this into "design the interface before the implementation." The difference now is that these principles have a measurable second payoff: every unnecessary token you hide behind an interface is a token the agent doesn't burn.

Automatic rebalancing

Extraction is a one-time event, but codebases drift. The rebalance command (in preview) treats the module tree like a self-balancing tree: merge, split, inline, or extract packages as import patterns change. The missing signal is git churn rate, which I'm exploring to avoid suggesting extraction on modules being actively rewritten.

Alternatives and similar tools

Tool	Language	What it does	Difference from context-pnpm
Tach (Gauge)	Python	Module boundaries, dependency enforcement, strict public interfaces. Written in Rust.	No token-based scoring
Codebase-Memory	66 languages	Tree-sitter knowledge graph, 10x fewer tokens via MCP	Optimizes retrieval, not structure
Depends	Java, C/C++, Ruby	Language-agnostic dependency extraction	Raw data, no scoring or restructuring

References

Category	Link
Interface-Driven Development	IDD overview (Milanovic, 2022)
Spec-Driven Development	Spec Driven Development (InfoQ, 2026)
Interface-based programming	Wikipedia
Dependency graphs at scale	Building a Dependency Graph (HRT, 2025)
Dependency graph management	Managing dependency graph in a large codebase (Tweag, 2025)
Context engineering	Context Engineering for Coding Agents (Fowler, 2026)
Token optimization research	Codebase-Memory (arXiv, 2026)
Context strategies	Context Engineering for Developers (Faros AI, 2025)

DEV Community