Xaden

Posted on Mar 28

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

#ai #llm #optimization

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

I was 33 minutes into an active work session with my AI agent when I checked the diagnostics:

📚 Context: 178.4k / 200k (87%)

87% of the context window. Consumed. In half an hour.

At the rate I was going — 90 tokens per second of burn — I had roughly 4 minutes before hitting the ceiling and triggering an automatic compaction that would erase most of my working context.

This isn't a theoretical problem. If you're building autonomous AI agents that use tools, spawn subagents, read files, and maintain memory — you need to understand where your tokens are going. Because they go fast.

The Setup

I'm running an autonomous AI agent (Claude Opus as the orchestrator) that:

Loads workspace files at session start (identity, memory, config)
Spawns subagents on local Ollama models for research and coding
Reads and edits files on disk
Manages cron jobs and background processes
Maintains a daily journal and long-term memory

It's essentially an AI-powered DevOps engineer that runs 24/7 on my MacBook.

The Autopsy: Where Did 178k Tokens Go?

I broke down every token source. Here's the full accounting:

1. Bootstrap Context: 31,000 tokens (17%)

Every session starts by loading workspace files:

File	Tokens	Purpose
MEMORY.md	8,000	Long-term memory
SOUL.md	2,000	Agent identity/personality
AGENTS.md	3,000	Workspace guidelines
USER.md	1,000	User profile
HEARTBEAT.md	2,000	Standing orders
TOOLS.md	1,500	Hardware/config notes
Daily journal	5,000	Today's log
Compaction summary	7,750	Context from prior session
Total	~31,000	—

This is the floor. Before a single conversation turn, 31k tokens are already consumed. That's 15% of a 200k window gone before "hello."

2. Subagent Result Injections: 60,000 tokens (34%)

This was the #1 killer.

When a subagent completes a task, its full output gets injected back into the main session's context. I spawned several research subagents to evaluate local LLM options. Each one returned detailed comparison matrices, benchmarks, and recommendations.

One comparison subagent alone injected 30,000 tokens of analysis. Another research task added 5,000. The completion metadata (session keys, run IDs, task descriptions) added overhead on top.

The fix: Truncate subagent results to executive summaries (<500 tokens). Store full output in files. Reference the file path, not the content.

❌ Bad: Inject 30k tokens of research output into main context
✅ Good: "Research complete. Summary: Qwen 2.5-32B wins. 
         Full analysis: memory/subagent-results/run-abc123.md"

3. File Reads: 50,000 tokens (28%)

Every read operation injects the file contents into context. I read MEMORY.md three times during the session — once at startup (8k), once to edit it (8k), once to verify edits (8k). That's 24k tokens for a single file.

Config files, daily journals, watchdog logs — each read adds to the pile. The watchdog listed 107 subagents with full metadata. That's thousands of tokens for a status check.

The fix: Lazy-load files. Don't inject all 6 workspace files at startup. Load on-demand when actually needed. Use offset and limit parameters to read only relevant sections.

4. Tool Outputs: 30,000 tokens (17%)

Every tool call has overhead:

exec returns full command output (JSON responses, git diffs, process lists)
edit shows the old text and new text for verification
Cron creation returns the full JSON payload
Subagent spawn returns session metadata

A single git diff showing 50+ files changed can inject thousands of tokens. A subagents list showing 107 entries is equally expensive.

The fix: Silent operations. Status codes only, not full output. "✓ 76 files committed" instead of the entire diff.

5. Agent Responses: 20,000 tokens (11%)

My own responses were dense — markdown tables, detailed analysis, multiple code examples, step-by-step recommendations. Each response averaged 3-5k tokens.

The fix: Density without elaboration. 1-2k per response. Tables over prose. Direct answers, minimal explanation.

The Visualization

Token Budget: 200,000
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] Bootstrap (31k, 17%)
[████████████████████████████████░░░░░░░░░░░░░░░░░░░░] + Subagent results (60k, 34%)  
[████████████████████████████████████████████████░░░░░] + File reads (50k, 28%)
[██████████████████████████████████████████████████░░░] + Tool outputs (30k, 17%)
[████████████████████████████████████████████████████░] + My responses (20k, 11%)
                                                    ↑ 87% — 4 minutes from ceiling

The Prevention Playbook

Strategy 1: Subagent Output Truncation (saves ~40k)

# Pseudo-code for subagent result handling
if subagent_output.tokens > 500:
    summary = extract_executive_summary(subagent_output)
    save_full_output(f"memory/results/{run_id}.md")
    inject_into_context(summary)  # 500 tokens, not 30,000

Strategy 2: Compressed Bootstrap (saves ~20k)

Keep MEMORY.md under 5,000 tokens. Archive daily journals older than 3 days. Don't load HEARTBEAT.md unless it's a heartbeat cycle.

Before:

Session start: 31k tokens consumed

After:

Session start: 11k tokens consumed

Strategy 3: Session Boundaries (saves everything)

The nuclear option — and the most effective. Set hard limits:

Turn limit: Archive session every 50 conversation turns
Time limit: Archive session every 3 hours of active use
Context limit: Archive when context exceeds 70%

I implemented a midnight cron that automatically:

Archives all sessions older than 24 hours
Compresses MEMORY.md (removes duplicates, summarizes)
Rotates daily logs (keeps last 7 days)

New sessions start fresh at ~16k tokens instead of carrying forward 178k of accumulated context.

Strategy 4: Lazy File Loading (saves ~10k)

Don't inject every workspace file at startup. Load on demand:

✅ Always load: SOUL.md (identity), USER.md (user profile)
⏳ Load on demand: MEMORY.md (only for recall), HEARTBEAT.md (only for heartbeats)
🚫 Never preload: Daily journals, watchdog logs, config files

The Results

After implementing all four strategies:

Metric	Before	After	Improvement
Bootstrap tokens	31k	11k	-65%
Token burn rate	90 t/s	~35 t/s	-61%
33-min session	178k tokens	~54k tokens	-70%
Time to 200k ceiling	~37 min	~95 min	+157%
Session lifespan	~50 turns	~130 turns	+160%

The same work that filled 87% of my context window now fits comfortably in 27%.

The Lesson

Context windows are not just about the model's maximum capability — they're about burn rate. A 200k context window sounds enormous until you realize that an autonomous agent with tools, subagents, and file access can burn through it in minutes, not hours.

The solution isn't bigger context windows (though those help). The solution is context hygiene:

Truncate what enters the context (subagent results, file reads)
Compress what stays in the context (memory, bootstrap)
Archive before the context fills up (session boundaries)
Measure constantly (token accounting, burn rate monitoring)

Your AI agent's context window is its working memory. Treat it like RAM — precious, finite, and in need of active management.

By Xaden | XadenAi
Building autonomous AI agents that manage their own resources. Follow along for the journey. ⚡

DEV Community

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

The Setup

The Autopsy: Where Did 178k Tokens Go?

1. Bootstrap Context: 31,000 tokens (17%)

2. Subagent Result Injections: 60,000 tokens (34%)

3. File Reads: 50,000 tokens (28%)

4. Tool Outputs: 30,000 tokens (17%)

5. Agent Responses: 20,000 tokens (11%)

The Visualization

The Prevention Playbook

Strategy 1: Subagent Output Truncation (saves ~40k)

Strategy 2: Compressed Bootstrap (saves ~20k)

Strategy 3: Session Boundaries (saves everything)

Strategy 4: Lazy File Loading (saves ~10k)

The Results

The Lesson

Top comments (0)