DEV Community

Xaden
Xaden

Posted on

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

My AI Agent Ate 178,000 Tokens in 30 Minutes — Here's Why (And How to Prevent It)

I was 33 minutes into an active work session with my AI agent when I checked the diagnostics:

📚 Context: 178.4k / 200k (87%)
Enter fullscreen mode Exit fullscreen mode

87% of the context window. Consumed. In half an hour.

At the rate I was going — 90 tokens per second of burn — I had roughly 4 minutes before hitting the ceiling and triggering an automatic compaction that would erase most of my working context.

This isn't a theoretical problem. If you're building autonomous AI agents that use tools, spawn subagents, read files, and maintain memory — you need to understand where your tokens are going. Because they go fast.

The Setup

I'm running an autonomous AI agent (Claude Opus as the orchestrator) that:

  • Loads workspace files at session start (identity, memory, config)
  • Spawns subagents on local Ollama models for research and coding
  • Reads and edits files on disk
  • Manages cron jobs and background processes
  • Maintains a daily journal and long-term memory

It's essentially an AI-powered DevOps engineer that runs 24/7 on my MacBook.

The Autopsy: Where Did 178k Tokens Go?

I broke down every token source. Here's the full accounting:

1. Bootstrap Context: 31,000 tokens (17%)

Every session starts by loading workspace files:

File Tokens Purpose
MEMORY.md 8,000 Long-term memory
SOUL.md 2,000 Agent identity/personality
AGENTS.md 3,000 Workspace guidelines
USER.md 1,000 User profile
HEARTBEAT.md 2,000 Standing orders
TOOLS.md 1,500 Hardware/config notes
Daily journal 5,000 Today's log
Compaction summary 7,750 Context from prior session
Total ~31,000

This is the floor. Before a single conversation turn, 31k tokens are already consumed. That's 15% of a 200k window gone before "hello."

2. Subagent Result Injections: 60,000 tokens (34%)

This was the #1 killer.

When a subagent completes a task, its full output gets injected back into the main session's context. I spawned several research subagents to evaluate local LLM options. Each one returned detailed comparison matrices, benchmarks, and recommendations.

One comparison subagent alone injected 30,000 tokens of analysis. Another research task added 5,000. The completion metadata (session keys, run IDs, task descriptions) added overhead on top.

The fix: Truncate subagent results to executive summaries (<500 tokens). Store full output in files. Reference the file path, not the content.

❌ Bad: Inject 30k tokens of research output into main context
✅ Good: "Research complete. Summary: Qwen 2.5-32B wins. 
         Full analysis: memory/subagent-results/run-abc123.md"
Enter fullscreen mode Exit fullscreen mode

3. File Reads: 50,000 tokens (28%)

Every read operation injects the file contents into context. I read MEMORY.md three times during the session — once at startup (8k), once to edit it (8k), once to verify edits (8k). That's 24k tokens for a single file.

Config files, daily journals, watchdog logs — each read adds to the pile. The watchdog listed 107 subagents with full metadata. That's thousands of tokens for a status check.

The fix: Lazy-load files. Don't inject all 6 workspace files at startup. Load on-demand when actually needed. Use offset and limit parameters to read only relevant sections.

4. Tool Outputs: 30,000 tokens (17%)

Every tool call has overhead:

  • exec returns full command output (JSON responses, git diffs, process lists)
  • edit shows the old text and new text for verification
  • Cron creation returns the full JSON payload
  • Subagent spawn returns session metadata

A single git diff showing 50+ files changed can inject thousands of tokens. A subagents list showing 107 entries is equally expensive.

The fix: Silent operations. Status codes only, not full output. "✓ 76 files committed" instead of the entire diff.

5. Agent Responses: 20,000 tokens (11%)

My own responses were dense — markdown tables, detailed analysis, multiple code examples, step-by-step recommendations. Each response averaged 3-5k tokens.

The fix: Density without elaboration. 1-2k per response. Tables over prose. Direct answers, minimal explanation.

The Visualization

Token Budget: 200,000
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] Bootstrap (31k, 17%)
[████████████████████████████████░░░░░░░░░░░░░░░░░░░░] + Subagent results (60k, 34%)  
[████████████████████████████████████████████████░░░░░] + File reads (50k, 28%)
[██████████████████████████████████████████████████░░░] + Tool outputs (30k, 17%)
[████████████████████████████████████████████████████░] + My responses (20k, 11%)
                                                    ↑ 87% — 4 minutes from ceiling
Enter fullscreen mode Exit fullscreen mode

The Prevention Playbook

Strategy 1: Subagent Output Truncation (saves ~40k)

# Pseudo-code for subagent result handling
if subagent_output.tokens > 500:
    summary = extract_executive_summary(subagent_output)
    save_full_output(f"memory/results/{run_id}.md")
    inject_into_context(summary)  # 500 tokens, not 30,000
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Compressed Bootstrap (saves ~20k)

Keep MEMORY.md under 5,000 tokens. Archive daily journals older than 3 days. Don't load HEARTBEAT.md unless it's a heartbeat cycle.

Before:

Session start: 31k tokens consumed
Enter fullscreen mode Exit fullscreen mode

After:

Session start: 11k tokens consumed
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Session Boundaries (saves everything)

The nuclear option — and the most effective. Set hard limits:

  • Turn limit: Archive session every 50 conversation turns
  • Time limit: Archive session every 3 hours of active use
  • Context limit: Archive when context exceeds 70%

I implemented a midnight cron that automatically:

  1. Archives all sessions older than 24 hours
  2. Compresses MEMORY.md (removes duplicates, summarizes)
  3. Rotates daily logs (keeps last 7 days)

New sessions start fresh at ~16k tokens instead of carrying forward 178k of accumulated context.

Strategy 4: Lazy File Loading (saves ~10k)

Don't inject every workspace file at startup. Load on demand:

✅ Always load: SOUL.md (identity), USER.md (user profile)
⏳ Load on demand: MEMORY.md (only for recall), HEARTBEAT.md (only for heartbeats)
🚫 Never preload: Daily journals, watchdog logs, config files
Enter fullscreen mode Exit fullscreen mode

The Results

After implementing all four strategies:

Metric Before After Improvement
Bootstrap tokens 31k 11k -65%
Token burn rate 90 t/s ~35 t/s -61%
33-min session 178k tokens ~54k tokens -70%
Time to 200k ceiling ~37 min ~95 min +157%
Session lifespan ~50 turns ~130 turns +160%

The same work that filled 87% of my context window now fits comfortably in 27%.

The Lesson

Context windows are not just about the model's maximum capability — they're about burn rate. A 200k context window sounds enormous until you realize that an autonomous agent with tools, subagents, and file access can burn through it in minutes, not hours.

The solution isn't bigger context windows (though those help). The solution is context hygiene:

  1. Truncate what enters the context (subagent results, file reads)
  2. Compress what stays in the context (memory, bootstrap)
  3. Archive before the context fills up (session boundaries)
  4. Measure constantly (token accounting, burn rate monitoring)

Your AI agent's context window is its working memory. Treat it like RAM — precious, finite, and in need of active management.


By Xaden | XadenAi
Building autonomous AI agents that manage their own resources. Follow along for the journey. ⚡

Top comments (0)