<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jeff Geiser</title>
    <description>The latest articles on DEV Community by Jeff Geiser (@jeff_geiser).</description>
    <link>https://dev.to/jeff_geiser</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3714979%2Fcdad1fab-9f27-43bb-b5f0-951a5f8ec8f1.png</url>
      <title>DEV Community: Jeff Geiser</title>
      <link>https://dev.to/jeff_geiser</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jeff_geiser"/>
    <language>en</language>
    <item>
      <title>WES: Why Tokens Per Watt Isn't Enough for Edge Inference</title>
      <dc:creator>Jeff Geiser</dc:creator>
      <pubDate>Wed, 11 Mar 2026 16:33:17 +0000</pubDate>
      <link>https://dev.to/jeff_geiser/wes-why-tokens-per-watt-isnt-enough-for-edge-inference-fl3</link>
      <guid>https://dev.to/jeff_geiser/wes-why-tokens-per-watt-isnt-enough-for-edge-inference-fl3</guid>
      <description>&lt;p&gt;Edge inference is still nascent.&lt;/p&gt;

&lt;p&gt;I work at Zenlayer helping companies deploy compute in hard to reach places. Spin up a VM with Ollama, pull a model, running inference in minutes. The infrastructure is there. The tooling is maturing. But the metrics for understanding what's actually happening on those nodes is still catching up.&lt;/p&gt;

&lt;p&gt;I've also been building Wicklee in my weekend time — a sovereign GPU fleet monitor written in Rust with an embedded React dashboard. Running a mixed fleet of Apple Silicon and AMD CPU nodes, I kept running into the same problem: the standard metrics weren't telling me quite enough.. &lt;/p&gt;

&lt;p&gt;Everyone in AI talks about efficiency at the data center level. Jensen talks tokens per watt. Google reports Gemini in watt-hours. Microsoft targets 8-20x energy reductions per query. Great work — but these are hyperscaler metrics, built for environments with precision cooling, facilities teams, and controlled everything.&lt;/p&gt;

&lt;p&gt;That's not edge inference.&lt;/p&gt;

&lt;p&gt;Here's a scenario that'll be familiar if you're running a distributed fleet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tok/s drops slightly&lt;/li&gt;
&lt;li&gt;board power creeps up slightly&lt;/li&gt;
&lt;li&gt;thermal state moves from Normal to Fair&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do get a drop in tokens/watt. But now what? Is it a blip? Is it meaningful? What do you chase?&lt;/p&gt;

&lt;p&gt;Honest answer: hard to tell. Tokens/watt can't distinguish a legitimate workload increase from thermal throttling. They look identical in the number. One means the node is doing its job. The other means it's quietly degrading and inserting some wait sates to prevent reaching a critical state. They require completely different responses.&lt;/p&gt;

&lt;p&gt;And in practice, a 15% drift on one node in a six-node fleet at 2am looks like noise. You don't chase it. The node keeps running. Keeps burning power. Keeps delivering worse inference. Until something obvious breaks.&lt;/p&gt;

&lt;p&gt;The data was there. Nothing put it in front of you.&lt;/p&gt;

&lt;p&gt;In other cases, the token/watt metric can stand still even if token/s is dropping if power output is also dropping. So, efficiency looks stable but throughput is actually dropping.&lt;/p&gt;

&lt;p&gt;So I wanted to add thermal state to the equation.&lt;/p&gt;

&lt;p&gt;WES — the Wicklee Efficiency Score:&lt;br&gt;
WES = tok/s ÷ (Watts_adjusted × ThermalPenalty)&lt;/p&gt;

&lt;p&gt;The ThermalPenalty comes directly from what the device reports — on Apple Silicon that's IOPMCopyCPUPowerStatus via IOKit, on NVIDIA it's the nvmlDeviceGetCurrentClocksThrottleReasons() bitmask. Not temperature guesses. Not externally imposed thresholds. The hardware's own classification of its thermal condition, amplified into the score.&lt;/p&gt;

&lt;p&gt;Thermal StatePenaltyNormal1.0Fair1.25Serious1.75Critical2.0+&lt;/p&gt;

&lt;p&gt;When thermals are clean, penalty is 1.0 and WES equals tokens/watt. When throttling starts, the penalty amplifies the drop — turning a subtle drift you'd dismiss as noise into something that screams at you.&lt;/p&gt;

&lt;p&gt;Higher WES = better. Miles per gallon for inference.&lt;/p&gt;

&lt;p&gt;Why the leaderboard is the real insight&lt;/p&gt;

&lt;p&gt;WES on a single node is useful. The Wicklee fleet leaderboard is where it gets interesting.&lt;/p&gt;

&lt;p&gt;Stack rank every node by WES. A thermally degraded node doesn't just show a number that drifted. It falls in the ranking. Drops below nodes it was beating yesterday. That positional change is impossible to miss — you don't need to be actively monitoring anything, you just notice your #1 node is now #3.&lt;/p&gt;

&lt;p&gt;That's the moment tok/watt never creates.&lt;/p&gt;

&lt;p&gt;WES surfaces the signal. The thermal panel explains the cause. Route requests to the top of the leaderboard and you're automatically routing away from degraded nodes without lifting a finger.&lt;/p&gt;

&lt;p&gt;Real numbers from my fleet&lt;br&gt;
Running llama3.2:3b via Ollama across hardware:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjo5rwi6dhfl9ena5pcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjo5rwi6dhfl9ena5pcz.png" alt=" " width="678" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;tok/s makes this look like a 6x gap. WES shows a 1,293x efficiency difference. The Ryzen is fast. It is not efficient.&lt;/p&gt;

&lt;p&gt;Now throw the M2 into thermal throttling — WES drops from 181.5 to 83.6. Still #1 on the leaderboard. But the drop is visible. The thermal panel tells you why. WES made you notice. Thermal data gave you the diagnosis. They work together.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Raw WES vs Penalized WES&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Wicklee reports WES two ways:&lt;br&gt;
Raw WES — ThermalPenalty forced to 1.0. Hardware ceiling under clean conditions. Essentially tok/watt.&lt;/p&gt;

&lt;p&gt;Penalized WES — live thermal penalty applied. Operational reality.&lt;br&gt;
The gap is your Thermal Cost — efficiency being lost to throttling right now:&lt;/p&gt;

&lt;p&gt;Thermal Cost % = (Raw WES − Penalized WES) / Raw WES × 100&lt;/p&gt;

&lt;p&gt;A node with Raw WES 181.5 and Penalized WES 83.6 is losing 54% of its potential efficiency to thermals. That's the number that drives action — not a raw temperature reading, not a wattage blip.&lt;/p&gt;

&lt;p&gt;Here's the implementation if you want to compute it yourself:&lt;/p&gt;

&lt;p&gt;javascriptconst THERMAL_PENALTIES = { Normal: 1.0, Fair: 1.25, Serious: 1.75, Critical: 2.0 };&lt;/p&gt;

&lt;p&gt;function computeWESPair(tps, watts, thermalState, pue = 1.0) {&lt;br&gt;
  if (tps == null) return { raw: null, penalized: null, thermalCostPct: null };&lt;br&gt;
  const penalty = THERMAL_PENALTIES[thermalState] ?? 1.0;&lt;br&gt;
  const w = watts * pue;&lt;br&gt;
  if (w &amp;lt;= 0) return null;&lt;br&gt;
  const raw = Math.round((tps / w) * 10) / 10;&lt;br&gt;
  const penalized = Math.round((tps / (w * penalty)) * 10) / 10;&lt;br&gt;
  const thermalCostPct = Math.round((1 - penalized / raw) * 100);&lt;br&gt;
  return { raw, penalized, thermalCostPct };&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;// Clean node:&lt;br&gt;
computeWESPair(108.9, 0.6, "Normal");  // → { raw: 181.5, penalized: 181.5, thermalCostPct: 0 }&lt;/p&gt;

&lt;p&gt;// Throttled node:&lt;br&gt;
computeWESPair(94.1, 0.9, "Fair");     // → { raw: 104.6, penalized: 83.6, thermalCostPct: 20 }&lt;br&gt;
And in Rust for the monitoring agent side:&lt;br&gt;
rustpub struct WESResult {&lt;br&gt;
    pub raw: Option,&lt;br&gt;
    pub penalized: Option,&lt;br&gt;
    pub thermal_cost_pct: Option,&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;pub fn compute_wes_pair(&lt;br&gt;
    tps: Option, watts: f64, thermal_penalty: f64, pue: f64&lt;br&gt;
) -&amp;gt; WESResult {&lt;br&gt;
    let compute = |p: f64| tps.and_then(|t| {&lt;br&gt;
        let w = watts * pue;&lt;br&gt;
        if w &amp;lt;= 0.0 { return None; }&lt;br&gt;
        Some((t / (w * p) * 10.0).round() / 10.0)&lt;br&gt;
    });&lt;br&gt;
    let raw = compute(1.0);&lt;br&gt;
    let penalized = compute(thermal_penalty);&lt;br&gt;
    let thermal_cost_pct = raw.zip(penalized)&lt;br&gt;
        .map(|(r, p)| ((1.0 - p / r) * 100.0).round());&lt;br&gt;
    WESResult { raw, penalized, thermal_cost_pct }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;WES is derived — compute it at render time from fields you're already collecting. No telemetry layer changes required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How WES relates to existing work&lt;/strong&gt;&lt;br&gt;
Stanford and Together AI published "Intelligence per Watt" (IPW) last year — accuracy divided by power, measured offline against benchmarks. Solid research. It answers "what is this hardware capable of per watt?"&lt;/p&gt;

&lt;p&gt;WES answers "what is it delivering right now?"&lt;/p&gt;

&lt;p&gt;Raw WES and IPW are the same question from different vantage points — IPW from a benchmark lab, WES from live fleet telemetry. IPW tells you the ceiling. WES tells you how close you're running to it, under real thermal conditions, continuously.&lt;/p&gt;

&lt;p&gt;**What's coming in Wicklee&lt;br&gt;
**The Fleet WES Leaderboard is shipping soon — every node ranked by Penalized WES, Raw WES as a secondary column, Thermal Cost % visible at a glance.&lt;/p&gt;

&lt;p&gt;After that, a series of benchmark posts:&lt;/p&gt;

&lt;p&gt;Cross-platform WES benchmarks — Apple Silicon vs AMD CPU vs NVIDIA GPU, same model, same prompt. Raw WES per platform.&lt;br&gt;
Thermal stress testing — deliberately inducing throttling and watching Raw vs Penalized WES diverge in real time.&lt;br&gt;
Sustained load degradation — how long before each platform throttles, how fast does WES collapse when it does.&lt;br&gt;
Edge enclosure testing — WES in a fanless case vs open air. Spoiler: not pretty.&lt;/p&gt;

&lt;p&gt;Goal: a reproducible WES dataset across hardware. Not just a formula — empirical data behind it.&lt;br&gt;
If you're running a local inference fleet, try computing your WES from the formula above and drop your numbers in the comments. Curious where different hardware lands.&lt;br&gt;
Miles per gallon for inference. When you need a race car, gas be damned — go for it. But at the edge, efficiency wins.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>monitoring</category>
      <category>performance</category>
    </item>
    <item>
      <title>Distributed Inference Observability gaps</title>
      <dc:creator>Jeff Geiser</dc:creator>
      <pubDate>Fri, 16 Jan 2026 19:10:12 +0000</pubDate>
      <link>https://dev.to/jeff_geiser/distributed-inference-observability-gaps-3pn3</link>
      <guid>https://dev.to/jeff_geiser/distributed-inference-observability-gaps-3pn3</guid>
      <description>&lt;p&gt;It seems that distributed inference observability has some gaps.&lt;/p&gt;

&lt;p&gt;In terms of framing this, I am referring to inference deployments at the edge (or so called near edge).. pops close to end users. Let's say you are using ollama for some early testing and/or scaling but are using vllm in production. &lt;/p&gt;

&lt;p&gt;Traditional monitoring platforms will report on GPU/CPU load, memory usage, network status, etc, etc.&lt;/p&gt;

&lt;p&gt;However, other stuff is also happening:&lt;/p&gt;

&lt;p&gt;GPU throttled - 100% utilization but clock speed dropped 33%&lt;br&gt;
KV cache saturated causing some queue backlog&lt;br&gt;
Time to first token spiked 200% from CPU contention&lt;br&gt;
Another tenant's PCIe traffic impacted inference&lt;/p&gt;

&lt;p&gt;maybe some contextual drift - some hardware stresses that degrade inference performance but it is happening in ways that is generally invisible to system metrics.&lt;/p&gt;

&lt;p&gt;Most of the monitoring in the market is built for servers and takes a peek at intervals that may not make sense for inference&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;token generation:20-100 per second&lt;/li&gt;
&lt;li&gt;cache saturation: spikes in seconds&lt;/li&gt;
&lt;li&gt;thermal throttling happens instantly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;but traditional monitoring might see this as smooth if only glancing at the server every 30 seconds. But, you also can't grab data every 2 seconds or you might contribute to some cpu scheduling pressure. &lt;/p&gt;

&lt;p&gt;So, if you are going to run both ollama (for dev/test or smaller loads) and vLLM for production they have completely different failure modes but traditional monitoring would treat them the same.&lt;/p&gt;

&lt;p&gt;We also have a blind spot with regard to time to first token (ttft) and time per output token (tpot). We might show request latency spiking, but we need to know whether ttft spiked or tpot spiked.. &lt;/p&gt;

&lt;p&gt;so, I am thinking about an open source project that would be a lightweight observability agent.. large companies will likely solve this by building a giant observability layer on top of their distributed inference solution -- but I think having a more bottoms up approach that can be deployed might make sense..&lt;/p&gt;

&lt;p&gt;the observability agent would strive to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;have limited cpu impact/overhead&lt;/li&gt;
&lt;li&gt;2 second sampling with some intelligent backoff&lt;/li&gt;
&lt;li&gt;built in ttft/tpot splitting&lt;/li&gt;
&lt;li&gt;contextual drift detection&lt;/li&gt;
&lt;li&gt;works with vLLM/Prometheus and Ollama API stats&lt;/li&gt;
&lt;li&gt;embedded DB storage (duckDB?) - no external dependencies&lt;/li&gt;
&lt;li&gt;runs at edge.. maybe federates.. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Curious to get feedback on where people are hitting observability gaps.. this is a new area for me to spend time on so curious about all feedback.&lt;/p&gt;

&lt;p&gt;What are you doing to monitor vLLM and/or other inference engines?&lt;/p&gt;

&lt;p&gt;What metrics do you wish you had?&lt;/p&gt;

&lt;p&gt;Drop the war stories here.. thanks.. &lt;/p&gt;

&lt;p&gt;(apologies for lack of formatting.. maybe I will get better over time..)&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>monitoring</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
