Chudi Nnorukam

Posted on Feb 10 • Edited on Jul 6 • Originally published at chudi.dev

Cut AI Token Usage 60% With Progressive Context Loading

#claudecode #ai #tokens #optimization

Originally published at chudi.dev

I was burning through $50/month in Claude API costs before I realized the problem. Every session loaded the same 8,000 tokens of context--rules, skills, examples--whether I needed them or not.

Progressive disclosure in AI context management means loading information in tiers: metadata first for routing, schemas on demand for understanding, and full content only when actively using a feature. The result? 60% fewer tokens consumed with better output quality, because the AI isn't parsing through irrelevant context to find what matters. Claude's context window documentation outlines the token limits that make this kind of optimization essential for longer sessions.

Why Was I Wasting So Many Tokens?

You waste tokens when your entire context file loads on every session regardless of what you are doing. A 2,400-line CLAUDE.md containing every skill definition, constraint, and example costs thousands of tokens upfront — even for a quick one-line fix. Most sessions only need a fraction of that content to produce accurate results.

My original CLAUDE.md file was 2,400 lines. Every skill definition. Every constraint. Every example. It loaded completely on every single session.

That specific dread of seeing "context limit reached" mid-task--when you're 80% done and suddenly the AI forgets everything--became a weekly occurrence.

I thought more context was always better. If the AI knows everything, it can handle anything. Right? Anthropic's documentation actually recommends against bloated system prompts for exactly this reason--focused context produces more reliable outputs.

Well, it's more like... drowning in information isn't the same as understanding what matters.

How Does the 3-Tier System Work?

The 3-tier system loads context in stages based on what you actually need. Tier 1 loads only skill metadata — names and triggers — at around 200 tokens. Tier 2 loads schemas and constraints when a skill activates, around 400 tokens. Tier 3 loads full handler logic and examples only for complex tasks, around 1,200 tokens.

Progressive disclosure splits context into three tiers, each loading only when needed.

Tier 1: Metadata (~200 tokens)

The routing layer. Just enough to know what skills exist and when to activate them.

## Skill: sveltekit_architect
**Triggers:** routes, layouts, prerender
**Dependencies:** None
**Priority:** High for SvelteKit projects

This is all that loads initially. Skill names, trigger patterns, quick reference. The AI knows the skill exists without knowing everything about it.

Tier 2: Schema (~400 tokens)

The contract layer. Input/output types, constraints, quality gates.

### Input Schema
- operation: "analyze_routes" | "create_layout" | "optimize_prerender"
- targetPath?: string

### Output Schema
- success: boolean
- summary: string
- nextActions: string[]

### Constraints
- Svelte 5 runes only
- Prerender for static pages
- Bundle under 300KB

This loads when the skill activates--when you're actually working with routes. Not before.

Tier 3: Full Content (~1200 tokens)

The implementation layer. Complete handler logic, examples, edge cases.

### Handler: analyze_routes
1. Glob all `src/routes/**/+page.svelte`
2. Check each for data loader presence
3. Verify prerender config matches content type
4. Return recommendations

### Example Output
{
  "success": true,
  "summary": "Found 12 routes, 3 missing data loaders",
  "nextActions": ["Add +page.server.ts to /blog/[slug]"]
}

This only loads when you're deep into a complex routing task. Most sessions never need it.

What Are the Actual Token Savings?

Progressive disclosure saves 60 to 92 percent of tokens per session depending on complexity. A skill that loads 3,302 tokens in full costs only about 200 tokens at Tier 1. Most sessions only activate one or two skills, so the majority of your context never loads at all.

Here's the meta-orchestration skill as a real example:

Tier	Lines	Tokens	When Loaded
Tier 1	278	~200	Every session
Tier 2	816	~600	On skill activation
Tier 3	3,302	~2,400	Complex tasks only

Without progressive disclosure: 2,400 tokens loaded every session.
With progressive disclosure: 200 tokens for most sessions, 600 for active skill use.

Savings: 60-92% depending on task complexity.

I load less to get more. The paradox makes sense once you see it in practice.

How Does Smart Mode Auto-Detect Verbosity?

Smart mode analyzes your query and selects from four verbosity levels automatically. Simple questions trigger Minimal mode, loading only Tier 1 at around 500 tokens. Standard tasks load activated Tier 2 skills at around 1,500 tokens. Complex implementations load full tiers for active skills at 4,000 tokens. Deep debugging loads everything at 8,000 tokens.

The system doesn't just have three tiers--it has four verbosity levels that auto-adjust:

Smart mode analyzes your query and picks the level. Simple questions get minimal context. Architecture questions get everything.

What Is the Lazy Module Loader Pattern?

The lazy module loader pattern imports skill files and reference docs only when your task explicitly references them, instead of loading everything upfront. Each module caches for 10 minutes and expires when unused. Deduplication ensures the same module never loads twice in one session, regardless of how many times it is referenced.

Beyond skill tiers, the system uses lazy loading for expensive modules:

// Instead of:
import { allSkills } from './skills'; // 8,000 tokens

// Use:
const getSkill = (name) => {
  return import('./skills/' + name + '.md');
};

Features

Dynamic imports: Load modules only when referenced
10-minute TTL: Cache loaded modules, expire unused ones
Deduplication: Never load the same module twice per session

This pattern works for skill files, reference docs, and example repositories. Anything that doesn't need to exist in context until explicitly requested.

How Do Skill Bundles Reduce Redundant Loading?

Skill bundles group related skills into a single deduplicated unit that activates together. Instead of loading three separate frontend skills with overlapping context, one frontend bundle activates and shares content between them. This eliminates redundant token usage when working across related skills and keeps the total context smaller than loading each skill independently.

Related skills often load together. Instead of loading them individually:

{
  "frontend-bundle": {
    "skills": ["react-patterns", "tailwind-stylist", "component-testing"],
    "tokens": 4500,
    "triggers": ["*.tsx", "*.css", "component"]
  }
}

When you're working on frontend code, the bundle activates as a unit. No loading three separate skills with overlapping context--one bundle with deduplicated content.

Current Bundles

frontend-bundle: React, UI/UX, web standards (4,500 tokens)
backend-bundle: API, database, patterns (4,200 tokens)
debugging-bundle: Error resolution, testing (2,500 tokens)
workflow-bundle: Git, CI/CD, deployment (3,200 tokens)

Each bundle is optimized to eliminate redundancy between related skills.

What's the Token Budget System?

The token budget system enforces hard context limits with automatic responses at each threshold. At 60 percent usage it starts dropping inactive skills to lower tiers. At 75 percent it stops loading new context. It always reserves 20 percent for AI responses. When limits approach, it summarizes and compresses old context before hitting the ceiling.

The context governor enforces hard limits:

75% max budget: Never exceed this regardless of task
60% warning threshold: Start aggressive tier reduction
20% reserve: Always keep space for AI responses

When approaching limits:

Phase unloading: Release completed phase context
Tier reduction: Drop to lower tiers for inactive skills
Auto-compact: Summarize and compress old context
Graceful degradation: Warn before hitting hard limits

That anxiety of context overflow--the system makes it manageable by budgeting proactively.

FAQ: Token Optimization for AI Tools

What is progressive disclosure in AI context management?
Progressive disclosure loads AI context in tiers: metadata first (~200 tokens), schemas on demand (~400 tokens), full content only when needed (~1200 tokens). This prevents loading thousands of unused tokens upfront.

How much can progressive disclosure save on AI costs?
In practice, progressive disclosure saves 40-60% of tokens per session. A skill that would load 3,302 tokens fully only loads 278 tokens at Tier 1--unless you actually need the deeper content.

Does loading less context hurt AI performance?
Counter-intuitively, no. Focused context with relevant information outperforms bloated context with everything. The AI processes fewer tokens to find what matters, leading to more accurate responses.

What are the three tiers in progressive disclosure?
Tier 1 is metadata (name, triggers, dependencies). Tier 2 is schema (input/output types, constraints). Tier 3 is full content (handler logic, examples). Each tier loads only when needed based on task complexity.

How do I implement progressive disclosure for Claude Code?
Split your CLAUDE.md into a router file (~500 lines) and reference files (loaded on demand). Use skill activation scores to determine which tier to load. Start with metadata, escalate to full content only for complex tasks.

Real Token Savings: What the Numbers Actually Look Like

Theory is one thing. Here's what three weeks of session data showed in practice.

I tracked token usage across 47 sessions--ranging from quick bug fixes to multi-hour feature implementations. Before implementing progressive disclosure, every session loaded my full CLAUDE.md: 2,400 lines, approximately 8,200 tokens consumed before I'd typed a single character.

After the 3-tier system:

Session type	Old usage	New usage	Savings
Quick fix (under 20 min)	8,200	320	96%
Standard feature	8,200	1,840	78%
Complex architecture	8,200	4,100	50%
Deep debugging session	8,200	6,800	17%

The quick fixes were the biggest surprise. I was burning 8,200 tokens to answer "what's the right Svelte 5 syntax for this component?"--when the answer needed maybe 200 tokens of context to be accurate.

The 60% average savings held across all session types. Complex architecture sessions saved less because they genuinely need the full skill content. But those sessions are maybe 15% of my total Claude usage. The other 85% are standard features and quick fixes where I was wildly overloading context.

One concrete month: April to May after switching. My Claude API bill dropped from $51 to $19. Same work output. Same code quality. Just the difference between loading everything and loading what's needed.

The counterintuitive part: output quality stayed the same or improved. On the quick fixes especially, removing irrelevant context meant Claude wasn't filtering through 8,000 tokens of routing rules and schema definitions to answer a simple syntax question. The focused context got more focused answers.

The $32 monthly savings was a side effect. The main effect was fewer mid-session context limits hitting right when I needed them least. That quality-of-life improvement mattered more than the cost reduction.

I thought the solution to context limits was bigger context windows. Well, it's more like... the solution was loading less, more intentionally.

Maybe the goal isn't maximum context. Maybe it's minimum necessary context--and systems that know the difference. For retrieval-based approaches to the same problem, the RAG paper by Lewis et al. pioneered how on-demand retrieval can substitute for loading all knowledge upfront.

Sources

Context windows - Anthropic (Anthropic)
Token counting - Anthropic (Anthropic)

Top comments (1)

jidonglab • Mar 24

the tiered loading approach is underrated. i hit the same wall with a massive CLAUDE.md — every session burned through tokens loading rules that had nothing to do with the current task.

one thing i'd add: beyond just structuring what gets loaded, there's a compounding effect when you also strip noise from the context that does get loaded. stuff like redundant tool call outputs, repeated file contents from earlier reads, verbose error traces — that's often 30-40% of your context by mid-session and the model can't distinguish signal from noise at that point. progressive disclosure on the input side + compression on the accumulated context side together is where the real savings stack up.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community