Your AI Agent Is Burning Tokens. Do You Know How Many?

#agents #ai #monitoring #python

I didn't. For weeks, I ran Claude Code sessions that cost 30K to 100K tokens without checking. Some were deep architectural work that justified every token. Others were "rename this file" requests that loaded my entire 1,200-line personality config before doing 15 seconds of work.

The problem isn't just that AI agents cost money. It's that we don't know when to act on cost data — and we don't even have the data to make that call.

So I built a metabolic layer. Not a dashboard. Not an enterprise observability platform. Just 15 lines of Python embedded in the Stop hook that already runs at the end of every session.

The harder part wasn't the code. It was figuring out when the data was actually telling me something.

When does data become a decision?

Five sessions? That's noise. One outlier where you debugged a production issue for 3 hours skews everything. You can't tell the difference between a pattern and a one-off.

Fifteen sessions? Now you've got something. You can compare session types. You can spot real outliers. But you're still guessing about long-term trends.

Thirty sessions? Decision territory. Enough history to project monthly costs, evaluate config changes, and know whether that optimization you made three weeks ago actually helped.

The framework I landed on:

L0 (1-4 sessions):   Collect only. No analysis, no conclusions.
                     Build the tracking habit first.

L1 (5-14 sessions):  Flag extremes. Sessions burning >80 edits get
                     highlighted. No automated decisions yet —
                     you're just learning what "normal" looks like.

L2 (15-29 sessions): Compare types. Complex vs. simple sessions.
                     The system surfaces the largest cost driver
                     across your session categories.

L3 (30+ sessions):   Decision-ready. Historical data supports
                     projecting savings from config changes.
                     Recalibrates every 30 sessions.

The thresholds aren't magic numbers. The principle is: match decision risk to data volume. At 5 sessions you can flag outliers using a fixed threshold that doesn't depend on a shaky average. At 15 sessions your baseline is stable enough to compare categories. At 30 sessions you've got enough history to project forward.

What I actually track

Claude Code's Stop hooks don't expose real token counts directly. So I use a proxy: edit count — every Edit or Write call in the session transcript, extracted via regex. A complex session might hit 50–120 edits. A simple one hits 0–2.

Is it perfect? No. Across 10 sessions where I manually cross-checked, edit count tracked within about 30% of actual token consumption. Good enough to tell "I rewrote half the codebase" from "I renamed a variable." And it costs nothing — the same Stop hook that enforces my learning capture habit now also appends a one-line JSON record.

The data lives in ~/.claude/session-data/cost-log.jsonl. Not in project memory. Not mixed with qualitative learning logs. Operational metrics and knowledge artifacts serve different purposes — they belong in different places.

What layered tracking surfaces that raw numbers don't

Raw tracking: "Session #47 used 85K tokens."

Layered tracking: "Over your last 20 sessions, complex sessions average 3x the cost of simple ones. Your cost-per-useful-output is 40% higher on sessions where you load the full personality config for simple tasks. That's about 2,000 tokens/month of avoidable overhead."

The first is data. The second is a decision prompt.

This isn't about money — at current API prices, 2,000 tokens/month is pocket change. It's about the methodology catching inefficiencies that raw monitoring would miss. The value isn't the dollar amount. It's knowing that your system can detect its own waste without you staring at dashboards.

The code

The recording logic is embedded in my existing quality-gate hook. It runs after all checks pass, at session end. Fifteen lines, stdlib only:

cost_log = os.path.expanduser('~/.claude/session-data/cost-log.jsonl')
os.makedirs(os.path.dirname(cost_log), exist_ok=True)
record = {
    'ts': datetime.datetime.now().isoformat(),
    'date': datetime.datetime.now().strftime('%Y-%m-%d'),
    'edits': edit_count,
    'complex': is_complex,
}
with open(cost_log, 'a', encoding='utf-8') as f:
    f.write(json.dumps(record, ensure_ascii=False) + '
')

The full script — with layer computation, cumulative stats, weekly/monthly breakdowns, and auto-generated reports at L3 — is about 170 lines. No external dependencies. No API keys. No services to configure.

At session start, a single line shows where you stand:

[L1] 12 sessions · avg 34 edits · 2 outliers flagged
  This week: 4 sessions · 142 edits | This month: 12 sessions · 408 edits
  Last: 2026-06-29 15:58 · 120 edits · complex

The bigger pattern

Any system that self-optimizes faces the same question: when is my data sufficient to justify a change? The answer isn't a fixed number. It's a layered threshold that matches what you can decide to how much you've observed.

If you're tracking anything about your AI usage — cost, quality, speed, accuracy — you need the layers. Not my specific L0-L3 numbers. But the discipline of separating "I'm collecting" from "I'm observing" from "I'm deciding."

Start tracking. Four sessions from now, you're at L1.

This article is part of the *Engineering Trustworthy AI Output** series. Tools and methodology: delivery-gate · checkgrow.*

中文版：掘金/YuhaoLin2005yhl · Code on GitHub