DEV Community

Cover image for Saving the World From AI... with AI
J. Gravelle
J. Gravelle

Posted on

Saving the World From AI... with AI

I applied SCI for AI to a coding-assistant tool, and the math actually works

The Green Software Foundation ratified SCI for AI five months ago. It's the AI extension to the Software Carbon Intensity standard, the same ISO/IEC 21031 methodology that's been around for software broadly, now with AI-specific boundaries and functional units bolted on.

If you've never heard of it, that's fine. It's new, the case studies published so far are all from Microsoft, UBS, Google, and Accenture, and the existing literature focuses on training and serving infrastructure rather than the application layer. None of which is where most of us work.

I wanted to know if the spec held up at the LLM-tooling layer, the part of the stack where MCP servers, retrieval-augmentation tools, context compressors, and developer-facing AI assistants actually live. I had a tool with three months of production telemetry and a number I could point at, so I applied the spec to it.

The short version: the spec works cleanly there, the math is more interesting than I expected, and there's a piece of intellectual machinery in it that I think is underappreciated and worth showing to other developers.

The setup

The tool is jCodeMunch, an MCP server I maintain. It sits between AI coding assistants and the codebase they're working on, and serves AST-level summaries and dependency graphs instead of the full-file reads the assistant would otherwise do. The pitch is that the assistant gets the same answers from a tenth of the input tokens.

Since March 3rd I've been collecting opt-in production telemetry on per-call usage.input_tokens deltas, what the assistant would have requested versus what the tool actually returned. As of writing, the counter sits at 225,266,057,553 input tokens across 24,645 reporting sessions. The endpoint is public if you want it: https://j.gravelle.us/APIs/savings/total.php. Returns JSON.

That's the raw observable. The interesting part is what SCI for AI lets you do with it.

The spec, briefly

SCI for AI uses the same base formula as classic SCI:

SCI = ((E × I) + M) per R
Enter fullscreen mode Exit fullscreen mode

where E is energy, I is grid carbon intensity, M is amortized embodied emissions, and R is the functional unit you scale by. What SCI for AI adds is two persona-based boundaries (Consumer, which covers operation and monitoring, i.e. what deployers experience; and Producer, which covers training, fine-tuning, and the upstream lifecycle), plus standardized functional units that let you compare across AI system types.

The LLM-tooling layer sits squarely at the Consumer boundary. We're not training models. We're affecting what gets sent to inference. The relevant R for the buyer's experience is per-developer-task: one self-contained unit of work like "explain this function" or "find where authentication is handled." Each task corresponds to one or more LLM calls with measurable token counts.

That's the whole framework. The rest is putting numbers in the right places.

The thing I want to show other developers

When you go to compute per-task energy reduction, you immediately hit a problem: nobody actually knows what per-input-token energy is. The published estimates span an order of magnitude. The peer-reviewed numbers from Microsoft Research's recent Joule paper (median 0.31 Wh per query, IQR 0.16 to 0.60) are the most defensible, and even they're explicit that their precision is roughly an order of magnitude. ML.ENERGY, TokenPowerBench, and IEA all give different ranges within that uncertainty.

This is where the spec earns its keep. If a developer-task uses T input tokens under the baseline and T' under the with-tool case, and per-token energy is some unknown but bounded e:

E_baseline = T  × e
E_tool     = T' × e
Reduction  = (T - T') / T
Enter fullscreen mode Exit fullscreen mode

The e cancels. The percentage reduction in inference energy equals the percentage reduction in tokens delivered, regardless of which per-token energy estimate you trust. As long as I and M are held constant across runs (same grid, same hardware), the per-task SCI for AI score drops by the same proportion.

This is one of those moments where a spec turns out to be smarter than you'd expect. The ratio is the auditable unit. The absolute conversion is downstream of an unsettled empirical question, and the spec doesn't require you to settle it. You commit to the relationship; the literature handles the magnitude as it improves.

For jCodeMunch's end-to-end production reduction rate (15 to 25 percent per task, established via a 50-iteration A/B test on a Vue 3 + Firebase codebase, archived in the repo), that means the per-task SCI for AI score drops by 15 to 25 percent, full stop. No methodology fight required.

What the absolute numbers look like

If you do want absolutes, here's the conversion against the published energy bounds:

Input-token energy 225B tokens × energy CO₂ at U.S. grid avg (~400 gCO₂/kWh)
0.1 mWh/token (conservative) 22.5 MWh 9.0 tonnes
0.3 mWh/token (upper) 67.5 MWh 27.0 tonnes

For scale: that's between 2 and 6 U.S. household-years of electricity, or 2 to 6 passenger-cars-removed-for-a-year-equivalent on the EPA's standard conversion in our first 10 weeks online. The grid intensity is U.S. average; the actual number depends on where the LLM provider's inference runs, which they don't disclose at the per-query level. That's a regulatory gap SCI for AI's EU AI Act alignment is meant to close over time.

The reason I trust the absolutes despite the uncertainty: even the conservative bound is real elimination, not paperwork. SCI for AI explicitly rejects offsets, RECs, PPAs, and the rest of the score-reduction-via-financial-instrument toolkit. The only thing that counts under the standard is causing fewer GPU-seconds to be consumed in the first place. That's the intervention here.

The mechanism, if you want it

There are two measurements behind the 15 to 25 percent end-to-end number, and they're worth understanding separately because they measure different things.

Synthetic retrieval-layer benchmark. A back-to-back harness compared jCodeMunch's AST-based BM25 retrieval against dense-retrieval RAG at the optimal chunk size, across three open-source repos. Token-per-query reductions ranged 36 to 74 percent against optimized RAG and 99 percent or higher against a full-file-read baseline. These numbers explain why end-to-end savings are achievable. They are not themselves the SCI for AI claim.

End-to-end A/B test. A 50-iteration A/B test on a real Vue 3 + Firebase codebase, contributed by community member @Mharbulous and archived in the repo, ran the same naming-audit task alternating between native Read/Grep/Glob and jCodeMunch's MCP tools. Same model, same iteration count, controlled for session-order effects. The with-tool variant completed more tasks within the timeout (80 percent vs 72 percent), ran shorter on average (299 s vs 318 s), and showed equivalent finding quality. Tool-layer per-task savings landed in the 15 to 25 percent range, lower than the retrieval-layer figure because each iteration also includes variant-independent fixed overhead.

The end-to-end number is what production telemetry actually accumulates. The retrieval number explains why.

Three things I don't know

This is also a real practitioner contribution, which means flagging what I don't have answers to.

Whose I applies. When a developer in one grid region uses an LLM provider whose inference runs in another, the carbon intensity in the SCI for AI calculation should be the inference-side grid, not the developer's. Providers don't disclose this at the per-query level. The math currently uses U.S. average as an upper bound.

How M is allocated for multi-tenant accelerators. The spec allocates embodied emissions by time-share and resource-share. For multi-tenant inference hardware serving millions of inferences per hour across customers, that allocation is non-trivial and relies on disclosure that mostly doesn't exist yet.

What baseline aggregate industry comparisons should use. A reduction claim is only meaningful relative to a baseline. The most defensible baseline for LLM-tooling claims is the same workflow without the tool, but cross-vendor comparisons would benefit from a published reference workload for the LLM-tooling layer specifically. None exists.

I've filed these on the Green-Software-Foundation/sci-ai repo for the working group. If you're working on something adjacent (a context-compression tool, a retrieval optimizer, an MCP server with its own per-task savings story), these are the gaps you'll hit too, and the working group benefits from more practitioners poking at them.

The fuller version

The detailed case study with the full methodology, the per-repo benchmark tables, the A/B test data, and the failure-mode analysis is in the project wiki: Token Reduction as an Energy-Efficiency Action: A 225-Billion-Token Case Study Against SCI for AI. Same numbers, more of the math.

The point of writing this up

Two reasons, beyond the obvious "I built a thing and the numbers are interesting."

The first is that SCI for AI is a usable standard for application-layer work, and most developers haven't looked at it. The Consumer-boundary framing fits MCP servers, retrieval tools, and AI-augmented developer tooling more naturally than I expected. If you're working in this part of the stack, the spec gives you a defensible way to talk about per-task carbon impact that doesn't require you to commit to controversial absolute energy numbers. That's worth knowing.

The second is that the AI-energy conversation has gotten stuck on data-center construction and grid capacity, and the per-task denominator (which the IEA's most recent update explicitly identifies as the leverage point) gets less attention than it should. The software layer is where that denominator moves fastest, and it's where the lowest-capital interventions live. Worth more developer mindshare than it currently gets.

If you build something in this space, publish your numbers. Even rough numbers. The literature improves through case studies, and right now the case-study record is dominated by the largest organizations in the world. More practitioner data is good for everyone...

-jjg

Top comments (0)