DEV Community

Cover image for We Need an Emission Test for AI
Osama Alghanmi
Osama Alghanmi

Posted on

We Need an Emission Test for AI

We test cars for emissions before they're allowed on the road. We rate appliances for energy efficiency. We slap labels on buildings telling you how much power they consume per square meter.

AI agents get none of this. No one asks how many tokens a system burned to answer a yes-or-no question.

Invisible Waste

Every token an LLM generates costs energy. Real electricity, real cooling, real hardware deprecation. A model that generates 2,000 tokens of preamble, caveats, and filler to deliver 40 tokens of actual information is producing waste. Physical, measurable, environmental waste.

Nobody's measuring it. We're in the "leaded gasoline" era of AI. The technology works, people love it, and the externalities are completely unpriced.

What Would This Look Like

A standardized benchmark for efficiency, not accuracy. Given a set of tasks with known correct answers, how many tokens does the system consume to get there?

Four metrics:

1. Token Efficiency Ratio (TER)

TER = useful_output_tokens / total_tokens_generated
Enter fullscreen mode Exit fullscreen mode

A system that generates 500 tokens but only 80 carry actual information has a TER of 0.16. That's an F rating.

2. Task Completion Cost (TCC)

How many tokens (input + output) does the agent consume to complete a well-defined task? Summarize this document. Fix this bug. Answer this question. Two systems that both produce the correct answer are not equal if one uses 10x as many tokens.

3. Retry and Exploration Overhead

Agentic systems are the worst offenders. An agent that tries 5 wrong approaches before stumbling on the right one might "work," but it consumes 5x as many resources as one that reasons the first time correctly.

4. Conversation Waste Index

In multi-turn interactions, how much of the conversation is the AI repeating itself, restating the question, or generating text that the user already knows? The equivalent of an engine idling in traffic. Burning fuel, going nowhere.

The Numbers

ChatGPT alone has 900 million weekly active users. Add Gemini, Claude, Copilot, and the rest, and the real number is well over a billion. If each interaction costs an average of 500 unnecessary tokens, that's 500 billion tokens wasted per week. At roughly 0.001 kWh per 1,000 tokens (a conservative estimate for inference), that's 500,000 kWh per week in pure waste. Enough to power about 50,000 homes

Agentic AI will multiply this by orders of magnitude. Systems that run autonomously, calling tools, spawning sub-agents, looping through retries. An agent that runs for 10 minutes, burning tokens in a loop, wastes more than your money. It wastes shared atmosphere.

Where the Analogy Holds and Where It Breaks

Cars produce CO2 as a byproduct of moving you from A to B. AI tokens produce CO2 as a byproduct of answering your question. In both cases:

  • The useful work can be done with vastly different amounts of waste
  • Consumers can't see the waste happening
  • Market incentives alone won't fix it (bigger models are "better," just like bigger engines were "better")
  • Regulation and labeling changed behavior (CAFE standards, Energy Star ratings)

Here's where it gets interesting: with cars, you can't make the engine think harder about whether it needs to burn that fuel. With AI, you can. The system can reason about whether a 2,000-token response is warranted or whether 50 tokens would do. The waste is in the software, not the physics, which actually makes this a more solvable problem than automotive emissions ever were.

What Would Change

Efficiency becomes a competitive axis: Right now, benchmarks reward accuracy. An emission benchmark would reward getting the same accuracy with fewer tokens. The model that scores 95% on 200 tokens is rated higher than the one that scores 96% on 2,000 tokens.

Agent frameworks get pressure to optimize: Today's agent architectures are shockingly wasteful. Retry loops, full-context re-reads, redundant tool calls. An emissions rating would push developers toward smarter planning, better caching, and more efficient use of tools.

Users get a basis for choosing: People pick AI tools based on vibes and marketing. An emission label, like the kWh sticker on your fridge, lets them factor in efficiency. "This agent is A-rated: it completes coding tasks with 3x fewer tokens than average."

Pricing reflects reality: Token pricing today is a race to the bottom. If we internalize the environmental cost, wasteful systems become expensive, and efficient ones become cheap.

Who Builds This

No single company should own this, especially not the ones selling the tokens. It needs:

  • An independent body (like the EPA, or the EU's energy labeling system) that defines the benchmark tasks and scoring
  • Standardized test suites: 100-500 diverse tasks with known optimal token budgets across coding, writing, reasoning, and agentic workflows
  • Transparent reporting: providers publish emission scores alongside capability benchmarks
  • Tiered ratings: A through F, stars, whatever makes it readable to non-technical users

The EU is already moving in this direction with the AI Act's sustainability provisions. But "report your energy consumption" is too vague. We need a per-task efficiency metric that lets you compare systems directly.

Two Agents, One Task

Same task: "Read this 200-line file and tell me if there's a SQL injection vulnerability."

Agent A reads the file and responds in 120 tokens:

"Yes. Line 47 passes user_input directly into an f-string SQL query without parameterization. Use parameterized queries instead."

Agent B reads the file, re-reads it, then generates 1,500 tokens: a summary of what SQL injection is, a history of the OWASP Top 10, three remediation strategies with code examples, notes about ORM interactions, and a disclaimer about completeness.

Both correct. Agent B might even score higher on current benchmarks for being "thorough." But it burned 12x the tokens, and 90% of its output was unrequested padding.

Agent A passes the emission test. Agent B is a gas guzzler.

So

We don't let cars on the road without testing their emissions. We shouldn't let AI agents into production without testing theirs.

As we scale these systems to billions of users and autonomous operation, we should probably figure out if we're building the computational equivalent of a 1970s muscle car: impressive, powerful, and catastrophically wasteful.

The token is the new gallon.


I'd like to hear from anyone working on AI sustainability, green computing, or model optimization. How would you design the benchmark?

Top comments (0)