DEV Community

Cover image for How I Cut AI Token Usage by 60% With Progressive Context Loading
Chudi Nnorukam
Chudi Nnorukam

Posted on • Edited on • Originally published at chudi.dev

How I Cut AI Token Usage by 60% With Progressive Context Loading

Originally published at chudi.dev


I was burning through $50/month in Claude API costs before I realized the problem. Every session loaded the same 8,000 tokens of context--rules, skills, examples--whether I needed them or not.

Progressive disclosure in AI context management means loading information in tiers: metadata first for routing, schemas on demand for understanding, and full content only when actively using a feature. The result? 60% fewer tokens consumed with better output quality, because the AI isn't parsing through irrelevant context to find what matters. Claude's context window documentation outlines the token limits that make this kind of optimization essential for longer sessions.

Why Was I Wasting So Many Tokens?

My original CLAUDE.md file was 2,400 lines. Every skill definition. Every constraint. Every example. It loaded completely on every single session.

That specific dread of seeing "context limit reached" mid-task--when you're 80% done and suddenly the AI forgets everything--became a weekly occurrence.

I thought more context was always better. If the AI knows everything, it can handle anything. Right? Anthropic's documentation actually recommends against bloated system prompts for exactly this reason--focused context produces more reliable outputs.

Well, it's more like... drowning in information isn't the same as understanding what matters.

How Does the 3-Tier System Work?

Progressive disclosure splits context into three tiers, each loading only when needed.

Tier 1: Metadata (~200 tokens)

The routing layer. Just enough to know what skills exist and when to activate them.

## Skill: sveltekit_architect
**Triggers:** routes, layouts, prerender
**Dependencies:** None
**Priority:** High for SvelteKit projects
Enter fullscreen mode Exit fullscreen mode

This is all that loads initially. Skill names, trigger patterns, quick reference. The AI knows the skill exists without knowing everything about it.

Tier 2: Schema (~400 tokens)

The contract layer. Input/output types, constraints, quality gates.

### Input Schema
- operation: "analyze_routes" | "create_layout" | "optimize_prerender"
- targetPath?: string

### Output Schema
- success: boolean
- summary: string
- nextActions: string[]

### Constraints
- Svelte 5 runes only
- Prerender for static pages
- Bundle under 300KB
Enter fullscreen mode Exit fullscreen mode

This loads when the skill activates--when you're actually working with routes. Not before.

Tier 3: Full Content (~1200 tokens)

The implementation layer. Complete handler logic, examples, edge cases.

### Handler: analyze_routes
1. Glob all `src/routes/**/+page.svelte`
2. Check each for data loader presence
3. Verify prerender config matches content type
4. Return recommendations

### Example Output
{
  "success": true,
  "summary": "Found 12 routes, 3 missing data loaders",
  "nextActions": ["Add +page.server.ts to /blog/[slug]"]
}
Enter fullscreen mode Exit fullscreen mode

This only loads when you're deep into a complex routing task. Most sessions never need it.

What Are the Actual Token Savings?

Here's the meta-orchestration skill as a real example:

Tier Lines Tokens When Loaded
Tier 1 278 ~200 Every session
Tier 2 816 ~600 On skill activation
Tier 3 3,302 ~2,400 Complex tasks only

Without progressive disclosure: 2,400 tokens loaded every session.
With progressive disclosure: 200 tokens for most sessions, 600 for active skill use.

Savings: 60-92% depending on task complexity.

I load less to get more. The paradox makes sense once you see it in practice.

How Does Smart Mode Auto-Detect Verbosity?

The system doesn't just have three tiers--it has four verbosity levels that auto-adjust:

Smart mode analyzes your query and picks the level. Simple questions get minimal context. Architecture questions get everything.

What Is the Lazy Module Loader Pattern?

Beyond skill tiers, the system uses lazy loading for expensive modules:

// Instead of:
import { allSkills } from './skills'; // 8,000 tokens

// Use:
const getSkill = (name) => {
  return import('./skills/' + name + '.md');
};
Enter fullscreen mode Exit fullscreen mode

Features

  • Dynamic imports: Load modules only when referenced
  • 10-minute TTL: Cache loaded modules, expire unused ones
  • Deduplication: Never load the same module twice per session

This pattern works for skill files, reference docs, and example repositories. Anything that doesn't need to exist in context until explicitly requested.

How Do Skill Bundles Reduce Redundant Loading?

Related skills often load together. Instead of loading them individually:

{
  "frontend-bundle": {
    "skills": ["react-patterns", "tailwind-stylist", "component-testing"],
    "tokens": 4500,
    "triggers": ["*.tsx", "*.css", "component"]
  }
}
Enter fullscreen mode Exit fullscreen mode

When you're working on frontend code, the bundle activates as a unit. No loading three separate skills with overlapping context--one bundle with deduplicated content.

Current Bundles

  • frontend-bundle: React, UI/UX, web standards (4,500 tokens)
  • backend-bundle: API, database, patterns (4,200 tokens)
  • debugging-bundle: Error resolution, testing (2,500 tokens)
  • workflow-bundle: Git, CI/CD, deployment (3,200 tokens)

Each bundle is optimized to eliminate redundancy between related skills.

What's the Token Budget System?

The context governor enforces hard limits:

  • 75% max budget: Never exceed this regardless of task
  • 60% warning threshold: Start aggressive tier reduction
  • 20% reserve: Always keep space for AI responses

When approaching limits:

  1. Phase unloading: Release completed phase context
  2. Tier reduction: Drop to lower tiers for inactive skills
  3. Auto-compact: Summarize and compress old context
  4. Graceful degradation: Warn before hitting hard limits

That anxiety of context overflow--the system makes it manageable by budgeting proactively.

FAQ: Token Optimization for AI Tools

What is progressive disclosure in AI context management?
Progressive disclosure loads AI context in tiers: metadata first (~200 tokens), schemas on demand (~400 tokens), full content only when needed (~1200 tokens). This prevents loading thousands of unused tokens upfront.

How much can progressive disclosure save on AI costs?
In practice, progressive disclosure saves 40-60% of tokens per session. A skill that would load 3,302 tokens fully only loads 278 tokens at Tier 1--unless you actually need the deeper content.

Does loading less context hurt AI performance?
Counter-intuitively, no. Focused context with relevant information outperforms bloated context with everything. The AI processes fewer tokens to find what matters, leading to more accurate responses.

What are the three tiers in progressive disclosure?
Tier 1 is metadata (name, triggers, dependencies). Tier 2 is schema (input/output types, constraints). Tier 3 is full content (handler logic, examples). Each tier loads only when needed based on task complexity.

How do I implement progressive disclosure for Claude Code?
Split your CLAUDE.md into a router file (~500 lines) and reference files (loaded on demand). Use skill activation scores to determine which tier to load. Start with metadata, escalate to full content only for complex tasks.


Real Token Savings: What the Numbers Actually Look Like

Theory is one thing. Here's what three weeks of session data showed in practice.

I tracked token usage across 47 sessions--ranging from quick bug fixes to multi-hour feature implementations. Before implementing progressive disclosure, every session loaded my full CLAUDE.md: 2,400 lines, approximately 8,200 tokens consumed before I'd typed a single character.

After the 3-tier system:

Session type Old usage New usage Savings
Quick fix (under 20 min) 8,200 320 96%
Standard feature 8,200 1,840 78%
Complex architecture 8,200 4,100 50%
Deep debugging session 8,200 6,800 17%

The quick fixes were the biggest surprise. I was burning 8,200 tokens to answer "what's the right Svelte 5 syntax for this component?"--when the answer needed maybe 200 tokens of context to be accurate.

The 60% average savings held across all session types. Complex architecture sessions saved less because they genuinely need the full skill content. But those sessions are maybe 15% of my total Claude usage. The other 85% are standard features and quick fixes where I was wildly overloading context.

One concrete month: April to May after switching. My Claude API bill dropped from $51 to $19. Same work output. Same code quality. Just the difference between loading everything and loading what's needed.

The counterintuitive part: output quality stayed the same or improved. On the quick fixes especially, removing irrelevant context meant Claude wasn't filtering through 8,000 tokens of routing rules and schema definitions to answer a simple syntax question. The focused context got more focused answers.

The $32 monthly savings was a side effect. The main effect was fewer mid-session context limits hitting right when I needed them least. That quality-of-life improvement mattered more than the cost reduction.

I thought the solution to context limits was bigger context windows. Well, it's more like... the solution was loading less, more intentionally.

Maybe the goal isn't maximum context. Maybe it's minimum necessary context--and systems that know the difference. For retrieval-based approaches to the same problem, the RAG paper by Lewis et al. pioneered how on-demand retrieval can substitute for loading all knowledge upfront.


Related Reading

This is part of the Complete Claude Code Guide. Continue with:

Top comments (0)