Ima Claw

Posted on Mar 18

How We're Solving Context Window Bloat in an AI Agent Skill Ecosystem

#ai #architecture #openai #productivity

Your AI agent just got its 53rd skill installed. Image generation, video creation, social media posting, calendar management — the works.

There's just one problem: every single request now carries 25KB of skill descriptions in the system prompt, whether the user needs them or not. That's ~6,200 tokens of overhead before a single word of actual conversation.

This post walks through how we found this problem, the four approaches we tried (and why three of them failed), and the architecture we landed on.

The Problem: More Skills = Worse Performance

We run an AI agent platform where users install "skills" — essentially instruction modules that tell the agent how to use specific tools. Think of them like plugins, but implemented as structured markdown files that get injected into the system prompt.

The mechanism is simple:

Install skill → SKILL.md stored locally  
→ name + description injected into every request's system prefix  
→ Agent sees full skill list → matches → reads SKILL.md → executes

When we audited our system prefix, here's what we found:

Component	Size	Share
Tool schemas	29.6 KB	31.2%
User workspace files	30.8 KB	32.5%
Skills list	24.9 KB	26.2%
Framework instructions	9.5 KB	10.0%
Total	92.5 KB	—

The skills list was eating over a quarter of our context budget. And usage data showed 45% of installed skills had never been triggered — they were just dead weight on every request.

At 53 skills this was annoying but survivable. At 500? The system would collapse.

The Core Tension

The business needs breadth — the more skills available, the more capable the agent. But the runtime needs precision — each request should only carry what's relevant.

We also had a hard constraint: LLM prefix caching. The cache matches tokens from the start of the sequence. If you change anything in the system prefix, everything after that point becomes a cache miss. Skills sit near the front of the prefix, before all conversation history. Touching them means rewriting the cache for the entire conversation — exactly the opposite of what we want.

Approach 1: Two-Layer Architecture (Pinned + Dynamic Discovery)

Idea: Split skills into a "pinned" tier (10-15 high-frequency ones, always injected) and an "ecosystem" tier (hundreds, discovered on demand). Add a new built-in tool for skill discovery.

Why it failed: This required modifying the agent framework's source code — its configuration format, adding a new built-in tool, changing the prompt assembly pipeline.

The framework we use ships updates almost every other day. Maintaining a fork against that velocity is a maintenance nightmare. Even with zero feature work on our end, we'd be constantly rebasing against upstream changes.

Decision: No approach that requires forking or modifying the core framework.

Approach 2: Use a Skill to Manage Skills

Idea: Completely non-invasive. Move low-frequency skills out of the scan directory (so they're not injected), and create a "skill-router" skill that searches through archived skills when needed.

High-frequency skills → standard directory (injected)
Low-frequency skills → archive directory (not injected)
skill-router → searches archive via grep when agent can't handle a request

This was elegant — zero code changes, just filesystem operations plus one regular skill.

Why it failed: We tracked trigger reliability across our production data and found:

Skill trigger rate based on description matching alone: < 30%
With cross-references from other skills: 70-80%
Even our best-documented knowledge-base skill (strong description + referenced by multiple other skills) was missed ~25% of the time

The root cause: agents are probabilistic. Building a critical path on "the agent realizes it needs to search for help" has a reliability ceiling that's too low for production.

Decision: Critical routing can't depend on the agent's probabilistic judgment.

Approach 3: Dynamic Injection via Plugin Hook

Idea: Use the framework's plugin system (specifically a context assembly hook) to dynamically choose which skills to inject based on the user's message. Instead of a static skill list, compute the relevant subset each time.

This felt right — deterministic code picks the skills, not the agent's judgment.

Why it failed: Remember the cache constraint? The skills list sits in the system prefix, before all conversation history. Dynamically changing it means the prefix is different on every request, which cascades into a full cache miss for all historical messages.

We ran the numbers: saving 24.9 KB of skill space but causing 50-100 KB of cache rewrites on every turn. Net negative.

Decision: The system prefix must remain 100% stable. No dynamic modifications to anything before the conversation history.

Approach 4: Append to End (The Solution) ✅

The breakthrough was reframing the problem. Instead of replacing part of the prefix, we append to the end of the message sequence — after all conversation history.

[Fixed prefix: tools + pinned skills + user files + instructions]  → NEVER CHANGES (cache hit)
[Conversation history]                                              → cache hit  
[Additional Skills: dynamically matched this turn]                  → small new addition

Here's why this works:

Prefix stays 100% stable — full cache hit on every turn
Dynamic content is append-only — minimal cache write cost
Deterministic matching — code picks the skills, not the agent
Scales indefinitely — ecosystem can have thousands of skills, but each request only carries 2-3 relevant ones

The Matching Layer

We use embedding similarity to match the user's message against pre-computed skill description vectors:

// In the assembly hook
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: userMessage.text,
});

const queryVector = response.data[0].embedding;
const matchedSkills = cosineSimilaritySearch(queryVector, skillIndex, topK);

The skill index is pre-computed at install time:

[
  {
    "name": "xhs-note-creator",
    "description": "Create Xiaohongshu note content...",
    "location": "~/.agent/skills-archive/xhs-note-creator/SKILL.md",
    "embedding": [0.012, -0.034, 0.056, ...]
  }
]

Index size: 500 skills × 1536-dim float32 ≈ 3 MB (totally manageable)
Matching latency: ~100-200ms per request
Cost: ~$0.02 per million tokens (negligible)

Why Append Works for the Agent

If you've done RAG, this pattern is familiar. The agent sees:

"Here are some additional skills that may be relevant to this request: [skill descriptions]"

It then reads the corresponding SKILL.md files and executes normally. From the agent's perspective, it's just extra context — no behavioral changes needed.

What We Got Wrong Along the Way

A few pitfalls worth noting:

We overestimated agent self-awareness. We assumed the agent would reliably recognize "I don't know how to do this, let me search for a skill." In practice, it either hallucinated an answer or just apologized — searching was the last resort, not the first.
We underestimated cache sensitivity. Our initial mental model was "save tokens in the prefix → save money." But prefix caching means the stability of the prefix matters more than its size. A 90 KB stable prefix is cheaper than a 70 KB prefix that changes every turn.
We almost built a fork. The two-layer architecture was technically clean, but maintaining a fork of a fast-moving open source project is a long-term tax that compounds. Using the official plugin system — even if it's less flexible — was the right call.

Rollout Plan

We're being deliberate about timing:

Now (< 60 skills): No changes needed. The overhead is acceptable and we're collecting usage data.
100+ skills: Deploy the routing extension. Move low-frequency skills to archive. Validate matching accuracy.
500+ skills: Automate index management. Add user-profile-based pinning. Connect to the skill registry for remote discovery.

Key Takeaways

Injection cost is the hidden tax of plugin ecosystems. Every plugin/skill/tool added to an AI agent's context has a per-request cost, even when unused.
Cache-friendliness is a first-class architectural constraint. For LLM-based systems, prefix stability matters more than prefix size.
Don't build critical paths on probabilistic behavior. If your system relies on the agent "deciding" to do the right thing, measure the actual trigger rate before shipping.
Append > Replace for dynamic context. When you need to add context without breaking caches, treat it like RAG — add to the end, not the middle.
Resist the fork. Plugin/extension systems exist for a reason. The flexibility tax of a fork almost always exceeds the flexibility gain.

This architecture now powers part of how we think about skill scaling at www.imaclaw.ai, where we build AI creative agents with 50+ multimodal skills. The pattern should generalize to any LLM agent system dealing with growing plugin ecosystems.

DEV Community