I formalized the human in the AI energy equation

#ai #llm #opensource #energy

I formalized the human in the AI energy equation

Felipe Cardoso, April 2026

Over the past few months I've been digging into something that's been bothering me since I started running LLMs locally on my PC. The question was simple: why does a 3 billion parameter model fail so much when running on its own, but when I sit next to it and guide it through the task, breaking things down, checking each output, rewording what failed, the results improve dramatically?

Anyone who's used Copilot, Cursor, or any local LLM has noticed this. But I wanted to go beyond just "noticing it." I wanted to know if you could measure it. And if you could put it in an equation.

You can. And I wrote a paper about it.

The problem nobody connected

The academic literature on LLMs treats inference as an autonomous process. The model receives input, generates output, someone measures joules per token. If it got it wrong, it regenerates. That's the cost.

When researchers study "human-in-the-loop", they focus on quality. The human as a corrector that improves accuracy. When they study task decomposition, they focus on performance. Breaking the task up makes smaller models perform better. And when they measure energy, they assume the model runs on its own.

What I realized is that nobody had connected all three. Nobody had asked: if the human decomposes the task, validates each step and reformulates what failed, what's the impact on energy consumption? Does the human reduce waste? And if so, how do you formalize that?

What I did

I created HAIL, Human-Augmented Inference for Lightweight Models. It's a mathematical framework that places the human programmer as an explicit variable in the energy cost equation of LLM inference. Not as an external observer, but as part of the system.

The core idea is an error decay function:

δ(H) = (1 − H)^γ

Where H ∈ [0,1] is the level of human intervention (0 = model running alone, 1 = full orchestration) and γ captures how effective that orchestration is, meaning how good the human is at steering the model.

When γ > 1, the first human interventions already have a disproportionate effect. You don't need to control everything, you just need to intervene at the right points. It's pair programming, not micromanagement.

The paper also introduces QDH (Quality-per-Dollar-Hour), a metric that measures output quality per dollar of hardware per hour. Because if you're running a model on an RTX 4070 instead of a datacenter A100, talking about "absolute quality" without considering cost tells you nothing useful.

What I found in my tests

I ran experiments on my desktop (RTX 4070, 12GB VRAM, Ollama) with models like GLM-4, Qwen2.5-Coder and their variants.

On decomposition, it helps, but you can't rely on it blindly. For medium complexity tasks like generating a React component or setting up a simple API, breaking things into subtasks improves results a lot. But for trivial tasks, decomposition turned out to be a negative factor because the model loses context between steps. And for very complex tasks, decomposition alone just doesn't cut it.

There's a crossover point I call C* in the paper. Below a certain complexity, letting the model run on its own is more efficient. Above that point, human intervention makes up for the coordination overhead. The framework formalizes where that threshold is.

Another thing that became clear in the tests: the human's value isn't just in decomposing the task. Simulating decomposition automatically, with no real human, produced results that were indistinguishable from autonomous mode on hard tasks. The real difference shows up when someone validates each step, carries context between subtasks, and adjusts direction as the model responds.

And the most common type of error was semantic. The 3B model generates code that compiles, passes the linter, looks correct. But it ignores half the instructions. You ask for an in-memory dictionary, it imports SQLAlchemy. You ask for error handling, it just doesn't do it. That kind of thing doesn't get fixed by regenerating. It gets fixed by someone who reads the output, understands what the model misinterpreted, and rewords the request.

Why this matters

If you're a student, an indie dev, or you live in a country where an enterprise GPU costs more than a car, the "just use a bigger model" narrative doesn't work for you.

HAIL suggests that a 3B model with a competent human in the loop can deliver quality comparable to a 70B running autonomously, at a fraction of the hardware and energy cost. And it makes six falsifiable predictions so that anyone with a PC and Ollama can test this.

I need people to test this

The experimental protocol is designed to run on consumer hardware. But I'm one person. For this to have real validity, I need more people running the same experiments under different conditions.

If you want to collaborate, here's what I'm looking for:

If you're early in programming (CS student, bootcamp, self-taught): you're the ideal profile for testing the framework under real conditions of someone who's still learning. The protocol has tasks of varying complexity, from simple to heavy. One of the things I need to find out is whether γ changes based on the operator's experience level. Minimum requirements: know how to run Ollama and have a GPU with at least 6GB of VRAM, or willingness to run a quantized model on CPU.

If you've been coding for a few years (mid/senior dev, freelancer, working on product): I need data from someone who already has fluency with LLMs and can steer the model more efficiently. The paper predicts γ > 1 for experienced operators, but that needs to be tested in practice. Requirements: ability to read the experimental protocol, run the benchmarks with energy measurement via RAPL (Linux) or HWiNFO (Windows), and report data in a format I can aggregate.

If you're a researcher or grad student in any related area (AI, HCI, green computing, software engineering): it would be great to have someone who can review the formal side of the paper. The protocol uses Latin square design with 15 replications per condition, and if you have experience with experimental design, I can send you the paper and we can discuss methodology.

In any case, what I need is someone willing to follow the protocol and send me the data. It doesn't need to be anything super formal, a spreadsheet works.