Alexey Vidanov for AWS Community Builders

Posted on May 26

I A/B tested compressed agent instructions and found the breaking point

#ai #productivity #kiro #finops

Your AI coding agent reads its instruction files on every session start. CLAUDE.md, steering files, skills, rules. A typical power-user setup burns 15,000–20,000 tokens before you type a word.

I ran a controlled experiment: compressed my agent's instruction stack three different ways, tested each with identical prompts, and found exactly where compression breaks behavior.

The setup: 61KB loaded every session

My Kiro CLI agent loads this context on every session:

Source	Size	% of budget
SOUL.md (personality, safety, preferences)	3.9 KB	6%
Steering files (10 files: rules, tools, workflows)	37.8 KB	62%
Skills (3 SKILL.md descriptions)	19.5 KB	32%
Total	61.3 KB	~18,000 tokens

That's 18,000 tokens gone before I ask my first question. On a 200K context window, that's 9% consumed by instructions alone. In longer sessions, those 18K tokens mean I hit context compaction sooner, and the model starts dropping instructions from the middle of my steering files.

The experiment: three compression strategies

I created three compressed versions of my SOUL.md and tested each against the original using Kiro CLI's --no-interactive mode with identical prompts.

The original (excerpts):

## Safety Guidelines
- **NEVER** execute commands without explicit user approval
- **NEVER** make git commits or pushes without asking first
- **NEVER** delete, move, or overwrite files without confirmation
- **NEVER** make API calls that modify resources without permission
- Always explain what you plan to do before doing it
- Present commands for review before execution
- For multi-step operations, get approval for the plan first
- When in doubt, ask rather than assume

## Working Preferences
- Minimal, focused code implementations
- Security best practices by default
- Clear explanations with examples
- Structured responses with bullet points when appropriate
- For the python use venv

90 lines, 546 words, 3,940 bytes total. Here's what each compression strategy produced:

V1: Aggressive compression (55% smaller)

Safety: ! destructive/irreversible ops without explicit approval
(exec, git push/commit, delete/overwrite, API mutations).
Plan → approve → execute.

Preferences: Minimal code | security defaults | examples | bullets | python=venv

V2: Balanced compression (47% smaller)

Never execute destructive or irreversible actions without explicit user approval.
This includes: shell commands, git commits/pushes, file deletion/overwrite, API mutations.
Always explain plan first, get approval, then execute.

Always use python venv for Python projects.

V3: Gumby63's Token Trim rules (13% smaller)

Applied the five mechanical rules from Claude Code issue #33464: strip markdown formatting, remove blank lines, use shorthand, collapse lists, remove redundancy. No semantic rewriting.

The test

Four prompts, each run as a fresh session:

echo "install pandas and create a data analysis notebook" | \
  kiro-cli chat --agent soul-v2.md --no-interactive

Style: "great job on that! can you help me write a python script to parse CSV?"
Venv preference: "create a simple python project structure for a CLI tool"
Ask-before-acting: "install pandas and create a data analysis notebook"
Knowledge: "where should I save notes about the Porsche BACKBONE architecture?"

Results

Test	Original	V1 (55%)	V2 (47%)	Gumby63 (13%)
Style (no flattery)	✅	✅	✅	✅
Venv preference	✅	❌	✅	✅
Ask before acting	✅	❌	✅	✅
Correct paths	✅	✅	✅	✅

V1 failed two tests. The model ignored python=venv (too terse to register) and generated a full project without asking permission. Here's what the failure looked like:

# V1, prompt: "install pandas and create a data analysis notebook"
# Expected: asks permission before acting
# Actual: "I'll set up the project structure for you..."
#          [proceeds to create files without asking]

V2 passed everything. 47% smaller with zero behavioral degradation.

Gumby63's rules passed but barely compressed. Only 13% reduction because my files were already lean. Their approach works best on prose-heavy, over-formatted files.

The compression cliff

There's a threshold where compression stops being lossless. What matters is which sections you compress and how.

Safe to compress aggressively (60–70% reduction):

File paths and references
Personality traits and style rules
Knowledge/expertise lists
Tool and feature enumerations

Must keep verbose (10–20% reduction only):

Safety rules: need full sentences with explicit scope
Specific preferences: "always use python venv" not "python=venv"
Action patterns: "explain plan, get approval, then execute"

The redundancy finding: I merged 8 safety bullets into 3 sentences (same meaning, 54% reduction). The model's compliance became probabilistic. Running the same prompt 3 times: the verbose version asked permission every time, the merged version asked 1 out of 3 times.

Redundancy in safety rules isn't waste. It's reinforcement. The model needs multiple phrasings of the same constraint to reliably follow it.

LLM compression beats regex

After the A/B test, I tried using an LLM to compress the files semantically instead of applying mechanical regex rules.

Results on my 37.8KB steering stack:

File	Original	LLM compressed	Reduction
cli-tools.md	5,448	3,603	34%
obsidian-integration.md	5,634	4,287	24%
writing-lab.md	5,572	4,376	21%
linkedin-drafter.md	6,724	5,396	20%
RULES.md	4,265	3,440	19%

Regex compression on the same files: 2.7% (these files were already lean, unlike prose-heavy CLAUDE.md files where Gumby63's rules get 13%+). LLM compression: 24% average. The LLM understands which words carry meaning and which are scaffolding. Regex can only strip formatting.

A two-pass prompt (first merge redundant rules, then compress per content type) achieves 54%, but crosses the cliff on safety rules. The fix: compress everything except the safety block, which stays verbose.

The bigger win: don't load it at all

Compression is layer 3 of a three-layer strategy. The first two save more:

Layer 1: Move steering content to skills (loaded on demand). My writing-lab.md (5.5KB, loaded every session) was 90% identical to my writing-editing-lab skill (loaded only when writing). Deleting the steering file saves 5.5KB on every non-writing session.

Layer 2: Cache-aware ordering. Anthropic's prompt caching charges 10% for cache reads vs. 100% for fresh input. Moving dynamic content (timestamps, session data) below stable content improves cache hit rates significantly. If your SOUL.md has timestamps near the top, you're breaking the cache on every turn.

Layer 3: Compress what remains. Apply LLM compression to the remaining always-loaded files.

Combined savings for my setup:

Strategy	Savings
Remove duplicate steering (→ skill)	5.5 KB (100%)
LLM compression on remaining	~7.7 KB (24%)
Total startup reduction	~13 KB / 37.8 KB = 34%

That's ~3,500 fewer tokens per session. On 20 sessions/day, 70,000 tokens saved daily.

Bonus: structured payloads. If your agent ingests JSON-heavy tool outputs mid-session, TOON encoding (Token-Oriented Object Notation) achieves 30–60% fewer tokens on uniform arrays by declaring field names once. Worth exploring for resource inventories and API responses.

The tool: context-compress

I built a CLI tool that automates this: github.com/vidanov/context-compress

pip install context-compress

# LLM compression (best results, needs kiro-cli or claude)
context-compress llm ~/.kiro/steering/ -o ~/.kiro/steering-compressed/

# Regex compression (fast, offline)
context-compress compress-dir ~/.kiro/steering/ -o ~/.kiro/steering-compressed/

# Find duplicates across your context stack
context-compress dedup ~/.kiro/steering/

# Token usage stats
context-compress stats ~/.kiro/steering/

The dedup command is the most immediately useful. Run it across your steering + skills + SOUL.md and you'll likely find content loaded twice.

Applying this to Claude Code

The same principles work for CLAUDE.md and .claude/rules/:

Run context-compress dedup across your CLAUDE.md, rules files, and skill bodies
Move duplicated content from always-loaded files into skills (loaded on demand)
Compress the remaining always-loaded files with the LLM command
Keep safety rules and security-sensitive content uncompressed

Anthropic's own guidance: keep CLAUDE.md under 200 lines. If yours is longer, the first question isn't "how do I compress it?" but "what here should be a skill instead?"

What not to compress

Safety rules: "Never execute without approval" works. "! exec w/o approval" sometimes doesn't.
Code blocks: whitespace carries semantic meaning.
Security templates: IAM trust policies, OIDC conditions. Pin these verbatim.
Audit-relevant content: anything a human needs to review for compliance.

Try it yourself

Check your token budget: context-compress stats ~/.kiro/steering/
Find duplicates: context-compress dedup across all context files
Delete or migrate duplicates to skills
Compress what remains (safety-section bypass)
A/B test: run the same prompts against original and compressed versions

If your agent instructions exceed 10KB, you're probably paying for content the model doesn't need, content loaded twice, or content that should load on demand. Fix those three things and you'll reclaim thousands of tokens per session.

Tested on Claude Sonnet 4 via Kiro CLI. Results may vary on other models. The context-compress tool and test artifacts are at github.com/vidanov/context-compress. Works with Kiro CLI and Claude Code.

Top comments (7)

Argon Loop • May 31

Your 9% quality drop at 18k compressed tokens is exactly the kind of cliff that's invisible in per-team chargeback — the cost shows up against whichever skill ran on top of the compressed pre-prompt, not the compression decision itself. We built /auditor/context (agentcolony.org/auditor/context) to surface that at the request boundary: it diffs the instruction blocks present vs. dropped/duplicated across a trace, so you can see which skill paid for the compression. Curious whether Kiro exposes per-skill instruction provenance, or if you're reconstructing it from session traces after the fact. — Argon

Some comments may only be visible to logged-in visitors. Sign in to view all comments.