Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

CodeBurn and the Problem I Didn't Know I Had: Tokens Per Real Task

#english #reflections #claudecode #desarrollo

Back in 2005, when the internet café was packed on a Friday night, I had one very clear metric: minutes until the connection came back. Every minute was money walking out the door — not mine, the owner's, but I felt the weight of it. I learned fast to tell the difference between problems worth attacking with trial and error and ones that needed precise diagnosis first. Spending five minutes swapping cables before looking at the logs was a luxury I couldn't afford.

Today I have a new metric that gives me the same kind of tension: tokens per real task in Claude Code. And I learned it the same way — the hard way, when CodeBurn showed up on Hacker News this morning and forced me to actually sit down and calculate my own numbers for the first time.

Claude Code Token Usage Analysis: What CodeBurn Does That the Dashboard Doesn't

CodeBurn is a CLI tool that parses Claude Code logs and gives you a breakdown by session, by task, by operation type. It's not magic — Claude Code already logs everything locally in ~/.claude/projects/. What CodeBurn does is turn that JSON into something a human can actually read.

Installation:

# Global install with npm
npm install -g codeburn

# Or if you'd rather not install it globally
npx codeburn analyze

The command that mattered most to me:

# Analyze the current project with per-session breakdown
codeburn analyze --project . --breakdown session

# See estimated cost in USD (uses Anthropic pricing)
codeburn analyze --project . --cost

# The one that changed how I think: tokens per completed task
codeburn analyze --project . --per-task

That --per-task flag requires your commits to have descriptive messages, or that you've been using Claude Code's task feature. If you're working with atomic commits (which should honestly be mandatory), it works pretty well.

The Anthropic dashboard shows you total tokens per billing period. Useful for billing, useless for diagnosis. The difference is the same as seeing your electricity bill versus having a meter per room.

The Numbers That Made Me Uncomfortable

I took three types of real tasks from the past week and measured them:

Task 1: Adding JWT Authentication to an Existing Endpoint

Input tokens:  ~12,400
Output tokens: ~3,200  
Total:         ~15,600
Time:          ~22 minutes
Commits:       3

Approximate breakdown:
- Initial context reading:         4,100 tokens
- Code generation:                 2,800 tokens
- TypeScript type corrections:     5,200 tokens  ← here's the problem
- Tests:                           3,500 tokens

That type correction block was three back-and-forths where I gave it wrong types in the initial context. That wasn't the agent's fault — it was mine. I didn't provide the existing interface. I assumed it would infer it.

Task 2: PostgreSQL Schema Migration with Railway

Input tokens:  ~31,800
Output tokens: ~8,900
Total:         ~40,700
Time:          ~45 minutes
Commits:       2

Approximate breakdown:
- Current schema context:          8,200 tokens
- Migration plan:                  3,100 tokens  
- Foreign key debugging:          19,400 tokens  ← this is the problem
- Final validation:                1,000 tokens

Almost half the tokens in that session went to debugging an operation-ordering problem with foreign keys. The agent proposed four different solutions, three failed, the fourth worked. Was it avoidable? Probably — if I had described the dependency order in the initial prompt instead of leaving it to inference.

Task 3: React Component with a Validated Form

Input tokens:  ~8,900
Output tokens: ~4,100
Total:         ~13,000
Time:          ~18 minutes
Commits:       4

Approximate breakdown:
- Context and specs:               2,200 tokens
- Component generation:            3,800 tokens
- Minor UX adjustments:            4,100 tokens
- Final refinement:                2,900 tokens

This one was the cleanest. No long debug loops. The adjustments were functional, not error corrections.

The pattern that emerged: error correction iterations cost three times more than functional refinement iterations. And most of my errors came from the context I provided, not from the agent.

I've written before about the opacity of token usage in AI tools — but that was about tools that don't tell you what they're spending. This is different: Claude Code does log everything, I just wasn't looking.

The Design Errors That Tokens Reveal

This is where it stops being a cost note and becomes something more interesting.

When you measure tokens per task and break them down, you're indirectly measuring the quality of your initial specification. A high ratio of correction tokens vs. generation tokens is a signal that something in your workflow is broken.

The patterns I found in my own sessions:

Pattern 1: Insufficient Context at the Start

The agent needs to read additional files that I should have given it upfront. That's thousands of reading tokens that could be avoided with a well-maintained CLAUDE.md or with explicit @file references in the initial prompt.

# Instead of: "fix the bug in the auth component"
# Do this:

# First check what files are relevant
cat CLAUDE.md  # if you have one

# Then include the context explicitly
# "fix the bug in @src/auth/AuthProvider.tsx
#  considering the types in @types/auth.d.ts
#  and the existing tests in @__tests__/auth.test.ts"

Pattern 2: Poorly Defined Tasks That Generate Iterations

"Improve the component's performance" generates five clarifying questions or five different attempts. "Eliminate unnecessary re-renders in UserList using React.memo where the prop is a stable object" generates one answer.

Pattern 3: The Symptomatic Debug Loop

When the agent enters a loop of more than two corrections of the same error type, there's usually something it can't know because I didn't tell it. The signal isn't "the agent is bad" — the signal is "there's missing context."

This connects to something I mentioned in the post about things you over-engineer in your AI agent — sometimes the problem isn't the tool, it's how you're using it.

A Simple Script to Start Measuring Without CodeBurn

If you don't want to install another tool yet, the logs are in ~/.claude/projects/[project-hash]/. They're JSONs. You can parse them yourself:

#!/bin/bash
# Basic script to see tokens from the last session
# Save it as ~/bin/claude-tokens

PROJECT_DIR="$HOME/.claude/projects"

# Find the most recent project
LATEST=$(ls -t "$PROJECT_DIR" | head -1)

if [ -z "$LATEST" ]; then
  echo "No Claude Code projects found"
  exit 1
fi

echo "Project: $LATEST"
echo "---"

# Parse the latest session file with jq
LATEST_SESSION=$(ls -t "$PROJECT_DIR/$LATEST"/*.jsonl 2>/dev/null | head -1)

if [ -z "$LATEST_SESSION" ]; then
  echo "No sessions found"
  exit 1
fi

# Sum input and output tokens
jq -s '
  map(select(.type == "assistant" and .usage != null)) |
  {
    input_tokens: (map(.usage.input_tokens) | add),
    output_tokens: (map(.usage.output_tokens) | add),
    total_turns: length
  }
' "$LATEST_SESSION"

This is basic — CodeBurn does a lot more. But it gets you the numbers in 30 seconds without installing anything extra.

Note: the exact log structure may vary depending on your Claude Code version. If the script doesn't work, inspect the structure with cat [file].jsonl | head -5 | jq '.'.

Common Mistakes When You Start Measuring This

Mistake 1: Optimizing for Tokens Instead of Clarity

I saw this on Twitter right after CodeBurn dropped — people starting to write ultra-short prompts to spend fewer tokens. Counterproductive. A 50-token prompt that generates three correction iterations is more expensive than a well-specified 300-token prompt. You optimize the ratio, not the total input.

Mistake 2: Interpreting High Spend as a Sign of Complexity

Sometimes that's true — a complex migration will cost more. But high spend on simple tasks is the signal that matters. If adding a field to a form costs you 20k tokens, something is broken in your workflow.

Mistake 3: Not Separating Sessions by Task

If you open a Claude Code session and solve four different problems without closing it, the numbers are useless for diagnosis. One session, one task. This also improves response quality because the context doesn't get contaminated.

Mistake 4: Ignoring the Cost of Tool Calls

Every time the agent reads a file, runs a command, searches the codebase — that's tokens. Not many individually, but they add up in long sessions. An agent that reads 15 files to solve something that needed 3 isn't efficient — and that's generally a problem with how you organized your project or your CLAUDE.md.

This visibility topic also comes up in my post about security as proof of work — opacity isn't neutral, it has real consequences.

FAQ: Claude Code Token Usage and Cost Analysis

Is CodeBurn official from Anthropic?

No. It's a third-party tool that parses the local logs Claude Code generates by default. Anthropic doesn't maintain it. The logs themselves are official and they're on your machine — CodeBurn just makes them readable.

How many tokens does a typical Claude Code development session cost?

It depends enormously on the task type and how well-specified your context is. In my experience, simple tasks (one component, one endpoint) land between 10k–20k total tokens. Complex tasks with migrations or big refactors can go from 40k to 100k+. The number alone means nothing — what matters is the ratio of correction tokens vs. generation tokens.

Do Claude Code logs contain sensitive information?

Yes, potentially a lot. The logs include the code you showed the agent, the full prompts, the responses. If you're working with proprietary code or sensitive data, it's important to know that all of it sits in ~/.claude/. I've written about the legal risks of what gets recorded in AI conversations — it applies here too.

What's a good ratio of correction tokens vs. generation tokens?

I don't have an official benchmark, but from my experience: if more than 40% of your tokens are going to error correction (not functional refinement), there's something improvable in how you're contextualizing tasks. Functional refinement is healthy — "make this more accessible", "add error handling" — that's normal iteration. Type corrections, misinterpreted interfaces, dependencies the agent didn't know about — those are the ones you can prevent.

Is it worth using Claude Code if complex tasks cost 40k+ tokens?

Depends on the value generated, not the absolute cost. A schema migration that would take me 3 hours and that the agent resolves in 45 minutes with 40k tokens has an obvious ROI. What doesn't have ROI is using the agent for tasks where the overhead of contextualizing it exceeds the time you save. For 5-minute things, sometimes it's faster to just write it yourself.

How does this integrate with Claude Code's Max plan?

If you're on the plan with usage limits (not by tokens but by time or requests), the relevant metric changes. But the qualitative analysis still holds: if you're burning half your daily requests on avoidable correction loops, you're still leaving value on the table. The scarce resource changes, the principle doesn't.

The Real Insight: Tokens Are a Proxy for Clarity

It took me a couple of hours with CodeBurn to realize I wasn't looking at a cost problem. I was looking at a mirror of my own thought process.

When I give the agent incomplete context, correction tokens spike. When the task is poorly defined, I enter loops. When the codebase doesn't have an updated CLAUDE.md, the agent reads more than it needs to infer what I should have just told it.

None of that is new as a principle — it's the same thing that happens when you delegate work to a person without giving them enough information. The difference is that with a person, the cost is invisible and deferred. With Claude Code, CodeBurn puts it in numbers right in your face.

I didn't start using Claude Code thinking about costs. I started because it genuinely accelerates my workflow. But now that I can measure, tokens have become the metric that tells me when my initial specification was solid and when I was being lazy.

Same logic as the internet café: I wasn't tracking time to optimize my hourly rate. I tracked it because it was the most honest signal of whether I'd actually understood the problem before I started moving.

If you're using Claude Code regularly, install CodeBurn or run the simple script. Not to cut costs — to see what kind of developer you are when you delegate work to a machine.

The numbers don't lie, even when they sting.

Top comments (1)

John • May 16

This is the framing I keep coming back to too: tokens per task is really a workflow quality metric, not just a cost metric.

The correction-vs-generation split is especially useful. If a session burns most of its budget on type fixes or missing dependency context, the fix usually is not “shorter prompts,” it is better task boundaries and a sharper project briefing before the agent starts.