I Found an LLM Weakness. Fixing It Cut My Token Usage 50%.
I hit this wall while building a personalized AI assistant I named Nathaniel. The prompt grew to over 1,000 lines: behavioral frameworks, crisis protocols, specialized knowledge, and operational procedures. It looked like it was working beautifully, until I began to notice particular instructions or directives were not being followed, despite consuming ~3,000 tokens before the conversation even started. After some digging and testing, I discovered the issue: for very long prompts, the model wasn't reliably processing content later in the document. Whether due to attention limitations, chunked file reading, or premature response generation, the effect was the same: instructions buried deep in the prompt were being missed.
The fix: put a Table of Contents at the top. When the model hits the TOC first, it has a reference map before it starts processing. It knows what sections exist and where to find them. Instead of hoping the model reaches the relevant instructions before responding, the TOC tells it exactly where to look. Loading only the relevant sections means those instructions are no longer buried deep, they're front and center. This also provided the ability to point the model directly at a particular section by name: "Use your writing protocol to enhance the following draft."
Other practitioners handle large prompts by splitting into multiple documents with routing logic, or building multi-prompt orchestration layers. Those work, but add deployment complexity. I wanted one document that the model could actually navigate, a way to prevent it from responding prematurely after the first chunk, or deprioritizing instructions buried later in the context.
This approach works in consumer interfaces like ChatGPT and Claude, but shows the most promise in agents and IDE-based assistants, environments where you control the system prompt and can apply context engineering practices with attached data files.
The traditional solutions all felt like compromises:
- Split into multiple documents: Dependency management and deployment complexity
- Load everything every time: Inefficient and slow
- Create a simplified version: Lose the comprehensive capability that made it valuable
The Mechanics
LLMs have a finite window when reading and parsing files. For very long one-shot prompts, this means chunking: the model reads the first chunk, processes it, moves to the next. The problem: the model often hits the first chunk and attempts to respond immediately, missing critical directives laid out later.
The TOC changes this dynamic. When the model hits the TOC first, it has a reference map before it starts processing. Instead of deciding what to use from three massive chunks (and probably getting it wrong), it can load the entire context set that actually matters for the request.
The benefits cascade:
- Smarter search: The model knows where to look before it starts looking
- Token savings: Load only relevant sections instead of everything
- Full context: Room for the complete context that matters, not truncated chunks
- Speed: Faster processing by eliminating unnecessary content
I called it TOC-Based Dynamic Loading.
The initial implementation used line numbers to identify section boundaries. That broke immediately. Any edit to the document shifted all downstream line numbers. For a living prompt document that evolves constantly, line numbers were a maintenance nightmare.
The solution: HTML comment markers as anchors. Insert <!-- #section-name --> at each section boundary. These markers are invisible to the LLM's output but serve as stable reference points. When you edit content within a section, the anchors don't move. The routing table maps keywords to anchor names, not line numbers. The pattern became maintainable.
💡 Key insight: Anchors survive content edits. Line numbers don't. For living documents, this is everything.
After weeks of refinement, production validation confirmed consistent results:
| Metric | Result |
|---|---|
| Token reduction | 44-63% per interaction |
| Processing speed | 30-40% faster |
| Targeted loading rate | 82% of requests |
| Fallback rate | 18% (under 20% target) |
Note: Fallback rate depends on keyword mapping granularity. Tighter mappings yield lower fallback rates.
How It Works for Personalizing Agents
The pattern has four components:
1. Core Context (Always Loaded)
The foundation every request needs: identity, behavioral constraints, quick reference. Think of it as the "personality kernel" that maintains assistant consistency.
Sizing guideline: Keep core context under 25% of your total document (if including it in a one-shot doc). If it's bigger, re-evaluate what's truly "always needed."
Another pattern: input core context (personality kernel) directly into agent instructions, with a pointer to an attached reference doc containing full behavioral instructions laid out in sections.
2. Keyword Mapping Table
A table that maps user request keywords to document sections:
| Keywords | Target Section | Priority |
|----------|---------------|----------|
| crisis, emergency, urgent | Crisis Management | HIGH |
| planning, strategy, roadmap | Strategic Frameworks | HIGH |
| code, build, deploy | Development Protocols | MEDIUM |
3. Targeted Loading
Based on keyword analysis, load 1-2 relevant sections instead of the entire document. Most requests need only a fraction of a comprehensive prompt.
4. Fallback Protocol
When requests are ambiguous or span multiple areas, load the full document. This ensures no capability loss. The pattern optimizes common cases without sacrificing edge cases.
The Critical Decision: Section Boundaries
This is where most implementations fail. How do you reliably identify where sections start and end?
I tested three approaches:
| Approach | Mechanism | Maintainability | Drift Risk | Verdict |
|---|---|---|---|---|
| Line Numbers | Map keywords to line ranges | Low | High (breaks on any edit) | ❌ Avoid |
| Split Files | Separate docs + index | Medium | None | ⚠️ Situational |
| Anchor-Based | Map keywords to section anchors | High | Low (survives edits) | ✅ Winner |
Why Anchors Win
Line Numbers broke immediately. Any edit shifted all downstream line numbers. Maintenance nightmare.
Split Files worked, but added deployment complexity and made the system harder to reason about.
Anchor-Based Routing is the winner. Use HTML comment markers as anchors at section boundaries:
<!-- #crisis-management -->
## Crisis Management Protocol
[... 200 lines of crisis content ...]
<!-- #development -->
## Development Protocols
[... 350 lines of dev content ...]
These HTML comment markers (<!-- #section-name -->) are invisible to the LLM's output but serve as stable reference points. When you edit content within a section, the anchors don't move. For living prompt documents that evolve constantly, this is critical.
Your routing header maps keywords to anchor names, not line numbers. This is the Simple Version:
## TOC - Dynamic Section Loading
**Load only the section(s) matching task keywords. Fallback to full load if ambiguous.**
**Table of Contents:**
| Keywords | Anchor | Description |
|----------|--------|-------------|
| crisis, emergency | #crisis-management | Production issues |
| code, build, test | #development | Development work |
Anchors survive content edits within sections. For living prompt documents that you're constantly refining, this is the difference between a maintainable system and a fragile one.
A Real Example
Here's what this looks like in practice. My AI assistant prompt transformation:
Before:
- 1,089 lines loaded every interaction
- ~3,000+ tokens consumed before the conversation started
- Noticeable latency on complex requests
After:
- Core context: ~200 lines (always loaded) or loaded first.
- Targeted sections: 200-400 lines (based on request)
- Total per interaction: 400-600 lines (~1,200-1,800 tokens)
- Fallback rate: <20% of interactions
Sample Routing Header (Robust Version)
For complex one-shot prompts with core context requirements:
## TOC - Dynamic Section Loading
### STEP 1: LOAD CORE CONTEXT (ALWAYS)
Identity & Behavioral Framework (~200 lines)
### STEP 2: ANALYZE USER REQUEST
Match keywords from the Table of Contents below.
### STEP 3: LOAD TARGETED SECTIONS
**Table of Contents:**
| Keywords | Anchor | Lines |
|----------|--------|-------|
| customer, account, support | #customer-protocols | ~250 |
| code, build, architecture | #development | ~300 |
| crisis, outage, emergency | #crisis-mgmt | ~150 |
| planning, strategy | #strategic | ~200 |
| team, mentor, feedback | #team | ~180 |
| write, document, blog | #writing | ~120 |
### STEP 4: FALLBACK
If no keywords match OR request spans 3+ areas → load full document
**Multi-section triggers**:
- "deploy to production" → #development + #crisis-mgmt
- "customer escalation" → #customer-protocols + #crisis-mgmt
When to Use This Pattern
✅ Apply when:
- Document > 500 lines (~4,000 tokens): Smaller docs don't need optimization
- 3+ distinct, independently-useful sections: Content must be modular
- Typical task uses < 50% of content: If most content is needed anyway, skip this
- Sections have clear boundaries: Can be anchored reliably
❌ Skip when:
- Sections are highly interdependent (need most content anyway)
- Content is consumed sequentially (not random access)
- Document changes frequently (keyword mapping overhead)
Use Cases Beyond AI Assistants
The efficiency gains matter, but what excites me most is what this pattern enables:
Portable Personal Context: Carry your AI assistant's personality, preferences, and learned patterns across different models and platforms. A well-structured prompt with dynamic loading can move from Claude to GPT to Gemini while maintaining consistent behavior.
Product Requirements Documents (PRDs): Feature specs with distinct sections (overview, requirements, constraints, acceptance criteria).
Technical Specifications: Architecture docs, API specs, system design documents with modular sections.
Behavioral Systems: AI agent instructions, persona definitions, role-specific operational knowledge.
Reference Documentation: Large reference materials, training content, compliance procedures accessed by topic.
Try It Yourself
To get started with large prompts:
- Audit your prompt: Identify distinct, independently-useful sections
- Define core context: What's needed for ANY request? Keep it under 25%
- Add anchors: HTML comments at section boundaries
- Build the routing table: Map user keywords to anchors
- Add fallback logic: Ensure no capability loss
Start simple. Refine keyword mappings as you observe how users interact with your assistant.
Let's Discuss
I'm curious what patterns others use for prompt optimization:
- Are you splitting documents?
- Using different routing mechanisms?
- Running into similar challenges?
Drop a comment. I'd love to hear what's working (or not working) for you.
Why I'm Sharing This
I spent weeks refining this pattern through trial and error. After my validation, I wanted to give back to the prompt engineering community. Context engineering is becoming a critical skill as the use of agents develops and prompts grow more sophisticated. Patterns like this will matter more as we push models to handle complex, persistent behaviors.
This article covers the pattern itself. The full case study on building Nathaniel, my AI assistant with 1,000+ lines of behavioral protocols, learning systems, and adaptive personality, is coming soon. If you're interested in the complete architecture and lessons learned, follow along.
This full pattern documentation is available as a GitHub Gist if you want to fork and adapt it.
Top comments (0)