gwittebolle

Posted on Apr 7

How I Measured 1 Tonne of CO2 from My AI Coding Sessions

#opensource #ai #sustainability #claude

I work as a carbon consultant for large companies. I also code every day with Claude Code. At some point, these two parts of my life started to feel contradictory, and I needed a number.

So I built claude-carbon - a bash + SQLite tool that hooks into Claude Code's session lifecycle and estimates CO2 emissions per session, live in the status line.

Here's what 4 months of data looks like.

The numbers

367 sessions analyzed
215 kg CO2e measured over 4 months
Projection: 0.9 to 1.5 tonnes/year (roughly a one-way Paris-New York flight)
For context: the average French person emits around 9 tonnes/year total

These numbers cover inference only - the energy consumed when the model processes my prompts and generates responses. Training, hardware manufacturing, cooling, and datacenter construction are not included. The real lifecycle footprint is higher.

Inference alone adds about 10% to my individual carbon budget. Just from coding sessions.

How it works

The tool hooks into Claude Code's Stop event. When a session ends, Claude Code writes a JSONL transcript to ~/.claude/projects/. The hook reads that file, counts tokens by model, applies emission factors, and persists the result to a local SQLite database.

The core formula:

session_co2_g = (input_tokens * input_factor + output_tokens * output_factor) / 1_000_000

The status line shows the running total live:

🟢 Sonnet 4.6 ░░░░ 12% | $1.80 | 38g CO₂ | my-project

Zero npm, zero Python, zero cloud. Just jq and sqlite3, both pre-installed on macOS.

Emission factors

The methodology comes from Jegham et al. 2025 (arxiv 2505.09598), a peer-reviewed study measuring inference energy on AWS infrastructure.

Factors (gCO2e/Mtok):

Opus: input 500, output 3000 (extrapolated 3x Sonnet)
Sonnet: input 190, output 1140 (measured)
Haiku: input 95, output 570 (extrapolated 0.5x Sonnet)

Infrastructure: AWS PUE 1.14, grid average 0.287 kgCO2e/kWh.

Output tokens cost ~6x more than input tokens. During prefill, the model processes all input tokens in parallel. During decoding, each output token requires a full sequential forward pass. That asymmetry is why long verbose responses are expensive.

What this is not

These numbers are estimates, not measurements. Several things make them imprecise:

Anthropic publishes no per-model energy data. Sonnet factors come from the Jegham paper's direct measurements. Opus and Haiku are extrapolated by scaling (3x and 0.5x Sonnet). Actual values depend on hardware configuration and batching strategies we don't have access to.

Prompt cache hits are overestimated. Cache reads consume less energy than fresh compute. The tool counts them at full rate, so numbers skew high.

Static grid average. The carbon intensity factor is a US grid average. It varies by region, datacenter, and time of day. Anthropic's actual energy mix is unknown.

Inference only. Training costs, hardware manufacturing, and cooling water are not included. The true lifecycle number is higher.

Treat these as order-of-magnitude estimates. They're good enough to identify patterns and test reduction levers. They're not suitable for regulatory reporting.

Architecture

Install:

git clone https://github.com/gwittebolle/claude-carbon.git ~/code/claude-carbon
bash ~/code/claude-carbon/scripts/setup.sh

Setup creates the SQLite database, backfills all existing ~/.claude transcripts, and prints your total. Then you add two entries to ~/.claude/settings.json: one for the status line, one for the Stop hook.

The backfill script parses historical JSONL transcripts by walking ~/.claude/projects/. Each transcript contains the full token counts per message, including model name. That's how I got 4 months of data retroactively.

Reducing your footprint

Measuring revealed which levers actually matter.

Right model for the task is the highest-impact lever. Opus costs 3x more per token than Sonnet. Using Haiku for subagents (file exploration, grep, code review) instead of Sonnet cuts those tasks by 80%.

RTK (Rust Token Killer) filters verbose CLI output before it hits the context window. Progress bars, passing test logs, verbose build output - all stripped. 60-90% token reduction on CLI commands, no quality loss.

Cap thinking tokens. Extended thinking can use up to 32k hidden tokens per message. Capping at 10k cuts thinking token usage by ~70% on routine tasks.

Compact earlier. Default autocompact triggers at 95% context usage. Setting it to 50% keeps sessions leaner.

Combined, these reduce total emissions by 50-70%. My projection goes from ~1.2 tonnes/year down to 0.4-0.6 tonnes if I apply all of them consistently.

Why I built this instead of just estimating

Two reasons.

First, the number in your head is always wrong. Before measuring, I assumed my usage was around 50-100 kg/year. It was double that. The discrepancy came from a few long architecture sessions I'd forgotten about, and from not accounting for how expensive Opus is.

Second, I work in carbon. I help large organizations build their emissions inventories. The methodology challenge here - uncertain emission factors, no primary data from the provider, static grid averages - is exactly what we deal with at scale. Working through it on a small personal dataset was useful practice.

What I'd like from Anthropic

One thing would make this significantly better: actual per-model energy data from Anthropic. Right now the Opus and Haiku factors are extrapolations. A single published figure - even a range - would let developers build more accurate tools.

Amazon, Google, and Microsoft publish at least some data center energy metrics. Anthropic publishes none. That's a gap worth closing.

DEV Community