DEV Community

Jeff Geiser
Jeff Geiser

Posted on

WES: Why Tokens Per Watt Isn't Enough for Edge Inference

Edge inference is still nascent.

I work at Zenlayer helping companies deploy compute in hard to reach places. Spin up a VM with Ollama, pull a model, running inference in minutes. The infrastructure is there. The tooling is maturing. But the metrics for understanding what's actually happening on those nodes? Still catching up.

I've also been building Wicklee in my weekend time — a sovereign GPU fleet monitor written in Rust with an embedded React dashboard. Running a mixed fleet of Apple Silicon and AMD CPU nodes, I kept running into the same problem: the standard metrics weren't telling me enough. That frustration is what led to WES.

Everyone in AI talks about efficiency at the data center level. Jensen talks tokens per watt. Google reports Gemini in watt-hours. Microsoft targets 8-20x energy reductions per query. Great work — but these are hyperscaler metrics, built for environments with precision cooling, facilities teams, and controlled everything.

That's not edge inference.

Here's a scenario that'll be familiar if you're running a distributed fleet:

  • tok/s drops slightly
  • board power creeps up slightly
  • thermal state moves from Normal to Fair

You do get a drop in tokens/watt. But now what? Is it a blip? Is it meaningful? What do you chase?

Honest answer: hard to tell. Tokens/watt can't distinguish a legitimate workload increase from thermal throttling. They look identical in the number. One means the node is doing its job. The other means it's quietly degrading. They require completely different responses.

And in practice, a 15% drift on one node in a six-node fleet at 2am looks like noise. You don't chase it. The node keeps running. Keeps burning power. Keeps delivering worse inference. Until something obvious breaks.

The data was there. Nothing put it in front of you.

In other cases, the token/watt metric can stand still even if token/s is dropping if power output is also dropping.

*Introducing WES — the Wicklee Efficiency Score
*

My answer: add thermal state to the equation.

WES — the Wicklee Efficiency Score:
WES = tok/s ÷ (Watts_adjusted × ThermalPenalty)

The ThermalPenalty comes directly from what the device reports — on Apple Silicon that's IOPMCopyCPUPowerStatus via IOKit, on NVIDIA it's the nvmlDeviceGetCurrentClocksThrottleReasons() bitmask. Not temperature guesses. Not externally imposed thresholds. The hardware's own classification of its thermal condition, amplified into the score.

Thermal StatePenaltyNormal1.0Fair1.25Serious1.75Critical2.0+

When thermals are clean, penalty is 1.0 and WES equals tokens/watt. When throttling starts, the penalty amplifies the drop — turning a subtle drift you'd dismiss as noise into something that screams at you.

Higher WES = better. Miles per gallon for inference.

Why the leaderboard is the real insight

WES on a single node is useful. The Wicklee fleet leaderboard is where it gets interesting.

Stack rank every node by WES. A thermally degraded node doesn't just show a number that drifted. It falls in the ranking. Drops below nodes it was beating yesterday. That positional change is impossible to miss — you don't need to be actively monitoring anything, you just notice your #1 node is now #3.

That's the moment tok/watt never creates.

WES surfaces the signal. The thermal panel explains the cause. Route requests to the top of the leaderboard and you're automatically routing away from degraded nodes without lifting a finger.

Real numbers from my fleet
Running llama3.2:3b via Ollama across hardware:

tok/s makes this look like a 6x gap. WES shows a 1,293x efficiency difference. The Ryzen is fast. It is not efficient.

Now throw the M2 into thermal throttling — WES drops from 181.5 to 83.6. Still #1 on the leaderboard. But the drop is visible. The thermal panel tells you why. WES made you notice. Thermal data gave you the diagnosis. They work together.

*Raw WES vs Penalized WES
*

Wicklee reports WES two ways:
Raw WES — ThermalPenalty forced to 1.0. Hardware ceiling under clean conditions. Essentially tok/watt.

Penalized WES — live thermal penalty applied. Operational reality.
The gap is your Thermal Cost — efficiency being lost to throttling right now:

Thermal Cost % = (Raw WES − Penalized WES) / Raw WES × 100

A node with Raw WES 181.5 and Penalized WES 83.6 is losing 54% of its potential efficiency to thermals. That's the number that drives action — not a raw temperature reading, not a wattage blip.

Here's the implementation if you want to compute it yourself:

javascriptconst THERMAL_PENALTIES = { Normal: 1.0, Fair: 1.25, Serious: 1.75, Critical: 2.0 };

function computeWESPair(tps, watts, thermalState, pue = 1.0) {
if (tps == null) return { raw: null, penalized: null, thermalCostPct: null };
const penalty = THERMAL_PENALTIES[thermalState] ?? 1.0;
const w = watts * pue;
if (w <= 0) return null;
const raw = Math.round((tps / w) * 10) / 10;
const penalized = Math.round((tps / (w * penalty)) * 10) / 10;
const thermalCostPct = Math.round((1 - penalized / raw) * 100);
return { raw, penalized, thermalCostPct };
}

// Clean node:
computeWESPair(108.9, 0.6, "Normal"); // → { raw: 181.5, penalized: 181.5, thermalCostPct: 0 }

// Throttled node:
computeWESPair(94.1, 0.9, "Fair"); // → { raw: 104.6, penalized: 83.6, thermalCostPct: 20 }
And in Rust for the monitoring agent side:
rustpub struct WESResult {
pub raw: Option,
pub penalized: Option,
pub thermal_cost_pct: Option,
}

pub fn compute_wes_pair(
tps: Option, watts: f64, thermal_penalty: f64, pue: f64
) -> WESResult {
let compute = |p: f64| tps.and_then(|t| {
let w = watts * pue;
if w <= 0.0 { return None; }
Some((t / (w * p) * 10.0).round() / 10.0)
});
let raw = compute(1.0);
let penalized = compute(thermal_penalty);
let thermal_cost_pct = raw.zip(penalized)
.map(|(r, p)| ((1.0 - p / r) * 100.0).round());
WESResult { raw, penalized, thermal_cost_pct }
}

WES is derived — compute it at render time from fields you're already collecting. No telemetry layer changes required.

How WES relates to existing work
Stanford and Together AI published "Intelligence per Watt" (IPW) last year — accuracy divided by power, measured offline against benchmarks. Solid research. It answers "what is this hardware capable of per watt?"

WES answers "what is it delivering right now?"

Raw WES and IPW are the same question from different vantage points — IPW from a benchmark lab, WES from live fleet telemetry. IPW tells you the ceiling. WES tells you how close you're running to it, under real thermal conditions, continuously.

**What's coming in Wicklee
**The Fleet WES Leaderboard is shipping in Phase 3A — every node ranked by Penalized WES, Raw WES as a secondary column, Thermal Cost % visible at a glance.

After that, a series of benchmark posts:

Cross-platform WES benchmarks — Apple Silicon vs AMD CPU vs NVIDIA GPU, same model, same prompt. Raw WES per platform.
Thermal stress testing — deliberately inducing throttling and watching Raw vs Penalized WES diverge in real time.
Sustained load degradation — how long before each platform throttles, how fast does WES collapse when it does.
Edge enclosure testing — WES in a fanless case vs open air. Spoiler: not pretty.

Goal: a reproducible WES dataset across hardware. Not just a formula — empirical data behind it.
If you're running a local inference fleet, try computing your WES from the formula above and drop your numbers in the comments. Curious where different hardware lands.
Miles per gallon for inference. When you need a race car, gas be damned — go for it. But at the edge, efficiency wins.

Top comments (0)