Warner Bell

Posted on Jan 3 • Edited on Jan 7 • Originally published at techstar.substack.com

I Found an LLM Weakness. Fixing It Cut My Token Usage 50%.

#ai #promptengineering #optimization #llm

I Found an LLM Weakness. Fixing It Cut My Token Usage 50%.

I hit this wall while building a personalized AI assistant I named Nathaniel. The prompt grew to over 1,000 lines: behavioral frameworks, crisis protocols, specialized knowledge, and operational procedures. It looked like it was working beautifully, until I began to notice particular instructions or directives were not being followed, despite consuming ~3,000 tokens before the conversation even started. After some digging and testing, I discovered the issue: for very long prompts, the model wasn't reliably processing content later in the document. Whether due to attention limitations, chunked file reading, or premature response generation, the effect was the same: instructions buried deep in the prompt were being missed.

The fix: put a Table of Contents at the top. When the model hits the TOC first, it has a reference map before it starts processing. It knows what sections exist and where to find them. Instead of hoping the model reaches the relevant instructions before responding, the TOC tells it exactly where to look. Loading only the relevant sections means those instructions are no longer buried deep, they're front and center. This also provided the ability to point the model directly at a particular section by name: "Use your writing protocol to enhance the following draft."

Other practitioners handle large prompts by splitting into multiple documents with routing logic, or building multi-prompt orchestration layers. Those work, but add deployment complexity. I wanted one document that the model could actually navigate, a way to prevent it from responding prematurely after the first chunk, or deprioritizing instructions buried later in the context.

This approach works in consumer interfaces like ChatGPT and Claude, but shows the most promise in agents and IDE-based assistants, environments where you control the system prompt and can apply context engineering practices with attached data files.

The traditional solutions all felt like compromises:

Split into multiple documents: Dependency management and deployment complexity
Load everything every time: Inefficient and slow
Create a simplified version: Lose the comprehensive capability that made it valuable

The Mechanics

LLMs have a finite window when reading and parsing files. For very long one-shot prompts, this means chunking: the model reads the first chunk, processes it, moves to the next. The problem: the model often hits the first chunk and attempts to respond immediately, missing critical directives laid out later.

The TOC changes this dynamic. When the model hits the TOC first, it has a reference map before it starts processing. Instead of deciding what to use from three massive chunks (and probably getting it wrong), it can load the entire context set that actually matters for the request.

The benefits cascade:

Smarter search: The model knows where to look before it starts looking
Token savings: Load only relevant sections instead of everything
Full context: Room for the complete context that matters, not truncated chunks
Speed: Faster processing by eliminating unnecessary content

I called it TOC-Based Dynamic Loading.

The initial implementation used line numbers to identify section boundaries. That broke immediately. Any edit to the document shifted all downstream line numbers. For a living prompt document that evolves constantly, line numbers were a maintenance nightmare.

The solution: HTML comment markers as anchors. Insert  at each section boundary. These markers are invisible to the LLM's output but serve as stable reference points. When you edit content within a section, the anchors don't move. The routing table maps keywords to anchor names, not line numbers. The pattern became maintainable.

💡 Key insight: Anchors survive content edits. Line numbers don't. For living documents, this is everything.

After weeks of refinement, production validation confirmed consistent results:

Metric	Result
Token reduction	44-63% per interaction
Processing speed	30-40% faster
Targeted loading rate	82% of requests
Fallback rate	18% (under 20% target)

Note: Fallback rate depends on keyword mapping granularity. Tighter mappings yield lower fallback rates.

How It Works for Personalizing Agents

The pattern has four components:

1. Core Context (Always Loaded)

The foundation every request needs: identity, behavioral constraints, quick reference. Think of it as the "personality kernel" that maintains assistant consistency.

Sizing guideline: Keep core context under 25% of your total document (if including it in a one-shot doc). If it's bigger, re-evaluate what's truly "always needed."

Another pattern: input core context (personality kernel) directly into agent instructions, with a pointer to an attached reference doc containing full behavioral instructions laid out in sections.

2. Keyword Mapping Table

A table that maps user request keywords to document sections:

| Keywords | Target Section | Priority |
|----------|---------------|----------|
| crisis, emergency, urgent | Crisis Management | HIGH |
| planning, strategy, roadmap | Strategic Frameworks | HIGH |
| code, build, deploy | Development Protocols | MEDIUM |

3. Targeted Loading

Based on keyword analysis, load 1-2 relevant sections instead of the entire document. Most requests need only a fraction of a comprehensive prompt.

4. Fallback Protocol

When requests are ambiguous or span multiple areas, load the full document. This ensures no capability loss. The pattern optimizes common cases without sacrificing edge cases.

The Critical Decision: Section Boundaries

This is where most implementations fail. How do you reliably identify where sections start and end?

I tested three approaches:

Approach	Mechanism	Maintainability	Drift Risk	Verdict
Line Numbers	Map keywords to line ranges	Low	High (breaks on any edit)	❌ Avoid
Split Files	Separate docs + index	Medium	None	⚠️ Situational
Anchor-Based	Map keywords to section anchors	High	Low (survives edits)	✅ Winner

Why Anchors Win

Line Numbers broke immediately. Any edit shifted all downstream line numbers. Maintenance nightmare.

Split Files worked, but added deployment complexity and made the system harder to reason about.

Anchor-Based Routing is the winner. Use HTML comment markers as anchors at section boundaries:

<!-- #crisis-management -->
## Crisis Management Protocol
[... 200 lines of crisis content ...]

<!-- #development -->
## Development Protocols  
[... 350 lines of dev content ...]

These HTML comment markers () are invisible to the LLM's output but serve as stable reference points. When you edit content within a section, the anchors don't move. For living prompt documents that evolve constantly, this is critical.

Your routing header maps keywords to anchor names, not line numbers. This is the Simple Version:

## TOC - Dynamic Section Loading

**Load only the section(s) matching task keywords. Fallback to full load if ambiguous.**

**Table of Contents:**

| Keywords | Anchor | Description |
|----------|--------|-------------|
| crisis, emergency | #crisis-management | Production issues |
| code, build, test | #development | Development work |

Anchors survive content edits within sections. For living prompt documents that you're constantly refining, this is the difference between a maintainable system and a fragile one.

A Real Example

Here's what this looks like in practice. My AI assistant prompt transformation:

Before:

1,089 lines loaded every interaction
~3,000+ tokens consumed before the conversation started
Noticeable latency on complex requests

After:

Core context: ~200 lines (always loaded) or loaded first.
Targeted sections: 200-400 lines (based on request)
Total per interaction: 400-600 lines (~1,200-1,800 tokens)
Fallback rate: <20% of interactions

Sample Routing Header (Robust Version)

For complex one-shot prompts with core context requirements:

## TOC - Dynamic Section Loading

### STEP 1: LOAD CORE CONTEXT (ALWAYS)
Identity & Behavioral Framework (~200 lines)

### STEP 2: ANALYZE USER REQUEST
Match keywords from the Table of Contents below.

### STEP 3: LOAD TARGETED SECTIONS

**Table of Contents:**

| Keywords | Anchor | Lines |
|----------|--------|-------|
| customer, account, support | #customer-protocols | ~250 |
| code, build, architecture | #development | ~300 |
| crisis, outage, emergency | #crisis-mgmt | ~150 |
| planning, strategy | #strategic | ~200 |
| team, mentor, feedback | #team | ~180 |
| write, document, blog | #writing | ~120 |

### STEP 4: FALLBACK
If no keywords match OR request spans 3+ areas → load full document

**Multi-section triggers**:
- "deploy to production" → #development + #crisis-mgmt
- "customer escalation" → #customer-protocols + #crisis-mgmt

When to Use This Pattern

✅ Apply when:

Document > 500 lines (~4,000 tokens): Smaller docs don't need optimization
3+ distinct, independently-useful sections: Content must be modular
Typical task uses < 50% of content: If most content is needed anyway, skip this
Sections have clear boundaries: Can be anchored reliably

❌ Skip when:

Sections are highly interdependent (need most content anyway)
Content is consumed sequentially (not random access)
Document changes frequently (keyword mapping overhead)

Use Cases Beyond AI Assistants

The efficiency gains matter, but what excites me most is what this pattern enables:

Portable Personal Context: Carry your AI assistant's personality, preferences, and learned patterns across different models and platforms. A well-structured prompt with dynamic loading can move from Claude to GPT to Gemini while maintaining consistent behavior.

Product Requirements Documents (PRDs): Feature specs with distinct sections (overview, requirements, constraints, acceptance criteria).

Technical Specifications: Architecture docs, API specs, system design documents with modular sections.

Behavioral Systems: AI agent instructions, persona definitions, role-specific operational knowledge.

Reference Documentation: Large reference materials, training content, compliance procedures accessed by topic.

Try It Yourself

To get started with large prompts:

Audit your prompt: Identify distinct, independently-useful sections
Define core context: What's needed for ANY request? Keep it under 25%
Add anchors: HTML comments at section boundaries
Build the routing table: Map user keywords to anchors
Add fallback logic: Ensure no capability loss

Start simple. Refine keyword mappings as you observe how users interact with your assistant.

Let's Discuss

I'm curious what patterns others use for prompt optimization:

Are you splitting documents?
Using different routing mechanisms?
Running into similar challenges?

Drop a comment. I'd love to hear what's working (or not working) for you.

Why I'm Sharing This

I spent weeks refining this pattern through trial and error. After my validation, I wanted to give back to the prompt engineering community. Context engineering is becoming a critical skill as the use of agents develops and prompts grow more sophisticated. Patterns like this will matter more as we push models to handle complex, persistent behaviors.

This article covers the pattern itself. The full case study on building Nathaniel, my AI assistant with 1,000+ lines of behavioral protocols, learning systems, and adaptive personality, is coming soon. If you're interested in the complete architecture and lessons learned, follow along.

This full pattern documentation is available as a GitHub Gist if you want to fork and adapt it.

DEV Community

I Found an LLM Weakness. Fixing It Cut My Token Usage 50%.

I Found an LLM Weakness. Fixing It Cut My Token Usage 50%.

The Mechanics

How It Works for Personalizing Agents

1. Core Context (Always Loaded)

2. Keyword Mapping Table

3. Targeted Loading

4. Fallback Protocol

The Critical Decision: Section Boundaries

Why Anchors Win

A Real Example

Sample Routing Header (Robust Version)

When to Use This Pattern

✅ Apply when:

❌ Skip when:

Use Cases Beyond AI Assistants

Try It Yourself

Let's Discuss

Why I'm Sharing This

Top comments (0)