DEV Community

Cover image for I formalized the human in the AI energy equation
Felipe Cardoso
Felipe Cardoso

Posted on

I formalized the human in the AI energy equation

Felipe Cardoso, April 2026


Over the past few months I've been digging into something that's been bothering me since I started running LLMs locally on my PC. The question was simple: why does a 3 billion parameter model fail so much when running on its own, but when I sit next to it and guide it through the task, breaking things down, checking each output, rewording what failed, the results improve dramatically?

Anyone who's used Copilot, Cursor, or any local LLM has noticed this. But I wanted to go beyond just "noticing it." I wanted to know if you could measure it. And if you could put it in an equation.

You can. And I wrote a paper about it.


The problem nobody connected

The academic literature on LLMs treats inference as an autonomous process. The model receives input, generates output, someone measures joules per token. If it got it wrong, it regenerates. That's the cost.

When researchers study "human-in-the-loop", they focus on quality. The human as a corrector that improves accuracy. When they study task decomposition, they focus on performance. Breaking the task up makes smaller models perform better. And when they measure energy, they assume the model runs on its own.

What I realized is that nobody had connected all three. Nobody had asked: if the human decomposes the task, validates each step and reformulates what failed, what's the impact on energy consumption? Does the human reduce waste? And if so, how do you formalize that?


What I did

I created HAIL, Human-Augmented Inference for Lightweight Models. It's a mathematical framework that places the human programmer as an explicit variable in the energy cost equation of LLM inference. Not as an external observer, but as part of the system.

The core idea is an error decay function:

δ(H) = (1 − H)^γ

Where H ∈ [0,1] is the level of human intervention (0 = model running alone, 1 = full orchestration) and γ captures how effective that orchestration is, meaning how good the human is at steering the model.

When γ > 1, the first human interventions already have a disproportionate effect. You don't need to control everything, you just need to intervene at the right points. It's pair programming, not micromanagement.

The paper also introduces QDH (Quality-per-Dollar-Hour), a metric that measures output quality per dollar of hardware per hour. Because if you're running a model on an RTX 4070 instead of a datacenter A100, talking about "absolute quality" without considering cost tells you nothing useful.


What I found in my tests

I ran experiments on my desktop (RTX 4070, 12GB VRAM, Ollama) with models like GLM-4, Qwen2.5-Coder and their variants.

On decomposition, it helps, but you can't rely on it blindly. For medium complexity tasks like generating a React component or setting up a simple API, breaking things into subtasks improves results a lot. But for trivial tasks, decomposition turned out to be a negative factor because the model loses context between steps. And for very complex tasks, decomposition alone just doesn't cut it.

There's a crossover point I call C* in the paper. Below a certain complexity, letting the model run on its own is more efficient. Above that point, human intervention makes up for the coordination overhead. The framework formalizes where that threshold is.

Another thing that became clear in the tests: the human's value isn't just in decomposing the task. Simulating decomposition automatically, with no real human, produced results that were indistinguishable from autonomous mode on hard tasks. The real difference shows up when someone validates each step, carries context between subtasks, and adjusts direction as the model responds.

And the most common type of error was semantic. The 3B model generates code that compiles, passes the linter, looks correct. But it ignores half the instructions. You ask for an in-memory dictionary, it imports SQLAlchemy. You ask for error handling, it just doesn't do it. That kind of thing doesn't get fixed by regenerating. It gets fixed by someone who reads the output, understands what the model misinterpreted, and rewords the request.


What the Claude Code leak has to do with this

If you've been following the space, you saw that last week the full source code of Claude Code leaked. Half a million lines of TypeScript that make up the "harness," the engineering layer that sits between the model and the user.

What became obvious to everyone is that the product isn't the model. The product is the harness. It's 19 tools with granular permissions, a three-layer memory architecture, five context compaction strategies, a sub-agent system with isolation. The model is pluggable. The intelligence of the system lives in the orchestration around it.

And this connects directly to what HAIL is trying to formalize, just from the human side. Claude Code solves the problem with heavy software engineering around the model. HAIL proposes that the human can do something similar in a dynamic and adaptive way, without needing 500,000 lines of code. The question is: how much of that can a human do "by hand" with a small model? And what's the energy cost of that compared to the heavy harness approach running in a datacenter?

These are complementary questions, not competing ones. And I think after this leak it became clearer why it makes sense to formalize the human's role in the equation.


I need people to test this

The paper makes six falsifiable predictions. The experimental protocol is designed to run on consumer hardware. But I'm one person. For this to have real validity, I need more people running the same experiments under different conditions.

If you want to collaborate, here's what I'm looking for:

If you're early in programming (CS student, bootcamp, self-taught): you're the ideal profile for testing the framework under real conditions of someone who's still learning. The protocol has tasks of varying complexity, from simple to heavy. One of the things I need to find out is whether γ changes based on the operator's experience level. Minimum requirements: know how to run Ollama and have a GPU with at least 6GB of VRAM, or willingness to run a quantized model on CPU.

If you've been coding for a few years (mid/senior dev, freelancer, working on product): I need data from someone who already has fluency with LLMs and can steer the model more efficiently. The paper predicts γ > 1 for experienced operators, but that needs to be tested in practice. Requirements: ability to read the experimental protocol, run the benchmarks with energy measurement via RAPL (Linux) or HWiNFO (Windows), and report data in a format I can aggregate.

If you're a researcher or grad student in any related area (AI, HCI, green computing, software engineering): it would be great to have someone who can review the formal side of the paper. The protocol uses Latin square design with 15 replications per condition, and if you have experience with experimental design, I can send you the paper and we can discuss methodology.

In any case, what I need is someone willing to follow the protocol and send me the data. It doesn't need to be anything super formal, a spreadsheet works.


The paper

The working paper (v2.1) is published as a preprint:

HAIL: Human-Augmented Inference for Lightweight Models

DOI: 10.5281/zenodo.19446269

Available in English and Portuguese. If you want to collaborate or just talk about the topic, reach out.


Felipe Cardoso is an independent researcher and Systems Analysis & Development student (IBMR) in Rio de Janeiro, Brazil.

Top comments (0)