How We Cut Claude Code Session Overhead with Lazy-Loaded Personas

#claudecode #ai #devtools #performance

If you use Claude Code with a heavily customized CLAUDE.md, every message you send carries that full file as context. Not just once at session start — on every turn.

That matters more than most people realize.

The Problem: Eager-Loading Everything

The naive approach to building a multi-persona system in Claude Code is to define all your personas directly in CLAUDE.md. It feels clean — everything in one place, always available.

The cost: if you have 23 specialist personas, each defined in 150-200 lines, you're looking at 3,000-5,000 tokens of persona definitions loaded on every single message — regardless of whether the current task has anything to do with a UX designer or a financial analyst.

Claude Code's CLAUDE.md is not a one-time setup file. It is re-injected into context on every turn. The larger it is, the more tokens you burn before you type a word.

The Pattern: Route First, Load on Demand

The fix is the same pattern software engineers have used for decades: don't load what you don't need until you need it.

Instead of embedding persona definitions in CLAUDE.md, you define a lightweight routing engine that reads signal words from the user's message and loads the relevant persona file on demand.

Eager approach (expensive):

# Personas

## Mary (Business Analyst)
Mary is a meticulous analyst who investigates existing state...
[150 more lines]

## Amelia (Developer Agent)
Amelia is an execution-focused developer who builds and edits files...
[150 more lines]

## Winston (Architect)
Winston designs systems, data flows, and infrastructure...
[150 more lines]

# ... 20 more persona blocks

Lazy approach (efficient):

## Persona Routing
Read routing-engine.md on every session.
Load personas on demand from ~/.claude/prism/ when triggered by signal words.
Only the active persona's file is in context.

With this structure, CLAUDE.md stays lean. The routing engine (routing-engine.md) is a single file that maps signal words to persona file paths. When a message contains "architecture" or "schema," Claude reads persona-architect-winston.md. When it contains "brainstorm" or "ideate," it reads persona-brainstorm-coach-carson.md. Everything else stays off-context.

Why This Matters Right Now

In April 2026, Claude Code users started reporting session costs 10-20x higher than expected. The root cause: a caching bug where context that should be served from cache is being re-tokenized and re-charged on every turn.

Eager-loading large CLAUDE.md files makes this worse. The bigger your baseline context, the higher your exposure when the cache misses. A 5,000-token persona block that should cost fractions of a cent per session can become a material cost per message when caching breaks.

Lazy-loading is not a fix for the cache bug. It is a structural hedge. Smaller baseline context means less blast radius when something goes wrong with token accounting — and it means lower costs even when everything works correctly.

How to Apply This Pattern

You don't need a 23-persona routing system to benefit from this. Three steps work for any Claude Code setup:

1. Audit your CLAUDE.md token weight.

Paste it into a tokenizer (Anthropic's tokenizer playground, or tiktoken for a rough proxy) or run wc -w CLAUDE.md as a fast estimate. If you're over 1,000 words, you have room to trim.

2. Move reference content to separate files.

Anything that isn't needed on every turn belongs in its own file. Coding style guides, persona definitions, workflow references, architecture docs — pull them out of CLAUDE.md and into named files in your .claude/ directory.

3. Add a routing section that tells Claude what to load and when.

## Reference Files (load on demand)
- Coding standards: Read ~/.claude/reference/coding-style.md when writing or reviewing code
- Architecture patterns: Read ~/.claude/reference/architecture.md when designing systems
- Deployment guide: Read ~/.claude/reference/deployment.md when working on CI/CD

Claude Code follows these instructions literally. The file only enters context when the task requires it.

PRISM Forge

This pattern is the foundation of PRISM Forge, an open-source Claude Code persona routing system with 23 specialist personas that load on-demand via signal-word routing. The full implementation is at github.com/prism-forge/prism.

The token savings are real. The architecture is simple. And the pattern applies to any Claude Code setup — no persona system required.

If you're building autonomous Claude Code workflows and want this architecture set up for your team, reach out on LinkedIn.