<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dilber</title>
    <description>The latest articles on DEV Community by Dilber (@dilberx).</description>
    <link>https://dev.to/dilberx</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3886105%2F2545acd7-accc-40b8-92fb-86ed5c055548.jpeg</url>
      <title>DEV Community: Dilber</title>
      <link>https://dev.to/dilberx</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dilberx"/>
    <language>en</language>
    <item>
      <title>Why Tokens Per Joule Matters More Than Tokens Per Second</title>
      <dc:creator>Dilber</dc:creator>
      <pubDate>Tue, 21 Apr 2026 13:01:00 +0000</pubDate>
      <link>https://dev.to/dilberx/why-tokens-per-joule-matters-more-than-tokens-per-second-1o06</link>
      <guid>https://dev.to/dilberx/why-tokens-per-joule-matters-more-than-tokens-per-second-1o06</guid>
      <description>&lt;p&gt;Most GPU benchmarks report tokens/sec, but that metric ignores the dominant driver of real-world inference cost: energy. I built a cross-platform telemetry suite to measure Tokens Per Joule (T/J) — tokens/sec ÷ watts — alongside throughput. Think of it as miles per gallon for inference. The reference data across Apple Silicon and NVIDIA challenges some common assumptions about hardware selection.&lt;/p&gt;

&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metric:&lt;/strong&gt; Tokens Per Joule (T/J) = tokens/sec ÷ watts. The inference equivalent of miles per gallon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Finding:&lt;/strong&gt; Apple M1 Pro achieves &lt;strong&gt;2.42 T/J&lt;/strong&gt; vs NVIDIA RTX 3080 at &lt;strong&gt;0.90 T/J&lt;/strong&gt; — a 2.7× energy efficiency gap on identical workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surprise:&lt;/strong&gt; A 13.7GB model (Llama-3.1-8B Q8_0 at 8192 context) runs fine on M1 Pro's unified memory but OOMs on the 3080's 10GB VRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Methodology:&lt;/strong&gt; 11 GGUF models, 3 context windows, 10 runs per config, 95% confidence intervals, automated WikiText-2 perplexity validation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open:&lt;/strong&gt; Pluggable architecture — adding AMD/Intel is ~100 LOC. Hardware ledger accepting community PRs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/dilberx/universal-llm-telemetry-suite" rel="noopener noreferrer"&gt;github.com/dilberx/universal-llm-telemetry-suite&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Tokens Per Joule?
&lt;/h2&gt;

&lt;p&gt;Tokens per second tells you how fast a GPU generates text. It doesn't tell you how much that speed costs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tokens Per Joule = Tokens/Sec ÷ Watts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what that looks like in practice. At 1M tokens/day inference load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;M1 Pro  (2.42 T/J):  1,000,000 / 2.42 = 413,223 Joules = 0.115 kWh/day
RTX 3080 (0.90 T/J): 1,000,000 / 0.90 = 1,111,111 Joules = 0.309 kWh/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At $0.12/kWh, that's roughly &lt;strong&gt;$0.014/day vs $0.037/day&lt;/strong&gt; — a 2.7× difference. Small on a single machine. But across a fleet of inference nodes running 24/7, or against a tight power budget on edge hardware, the gap compounds fast.&lt;/p&gt;

&lt;p&gt;For any team where energy cost or power budget matters — edge devices, on-premise clusters, sustainability-conscious deployments — T/J is the metric that maps directly to operational cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  Power Measurement Methodology
&lt;/h2&gt;

&lt;p&gt;Before showing the data, I want to be upfront about how power is measured — because this is the single biggest source of confusion in cross-platform GPU benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVIDIA RTX 3080:&lt;/strong&gt; Power is read via &lt;code&gt;pynvml&lt;/code&gt; (&lt;code&gt;nvmlDeviceGetPowerUsage&lt;/code&gt;), which reports &lt;strong&gt;GPU board power (TBP)&lt;/strong&gt; in milliwatts. This does NOT include CPU, system RAM, or PSU losses. Sampled at 500ms intervals via a daemon thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apple M1 Pro:&lt;/strong&gt; Power is read via &lt;code&gt;sudo powermetrics&lt;/code&gt;, which reports &lt;strong&gt;whole-SoC power&lt;/strong&gt; — CPU + GPU + memory controller + IO. This is a broader measurement than NVIDIA's. Sampled at 500ms intervals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What this means:&lt;/strong&gt; The Apple measurement is more inclusive. If anything, this &lt;em&gt;disadvantages&lt;/em&gt; Apple in the comparison — if we added CPU idle power and memory controller power to the NVIDIA measurement, the efficiency gap would likely be wider. We report exactly what each vendor's API provides. No adjustments.&lt;/p&gt;

&lt;p&gt;Temperature, VRAM/memory usage, and clock speeds are logged continuously alongside power into &lt;code&gt;thermal_log.csv&lt;/code&gt; for every run.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reference Data
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA RTX 3080 10GB GDDR6X (Linux, CUDA, latest stable driver)&lt;/li&gt;
&lt;li&gt;Apple M1 Pro 32GB UMA (macOS, Metal via llama.cpp)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software:&lt;/strong&gt; &lt;a href="https://github.com/ggerganov/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt; (Metal and CUDA builds), models in &lt;a href="https://github.com/ggerganov/ggml" rel="noopener noreferrer"&gt;GGUF format&lt;/a&gt; (a binary format optimized for fast loading and inference with quantized LLMs) from Hugging Face.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workload:&lt;/strong&gt; 11 GGUF models (Qwen-2.5-3B, Mistral-7B, Llama-3.1-8B across Q4_K_M, Q5_K_M, and Q8_0 quantizations — where Q8_0 means 8-bit integer weights with no group scaling, the highest-fidelity quantization that still fits in reduced memory), 3 context window sizes (512, 2048, 8192 tokens), 10 runs per configuration, 95% confidence intervals computed from the distribution. WikiText-2 perplexity measured alongside throughput to verify that quantization-driven speed gains don't degrade output quality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v23iaog3zqhrb8a9jdh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5v23iaog3zqhrb8a9jdh.png" alt="The Efficiency Frontier: M1 Pro clusters at 2–3× higher T/J across all model families. Each data point is a 10-run average."&gt;&lt;/a&gt;&lt;em&gt;The Efficiency Frontier: M1 Pro clusters at 2–3× higher T/J across all model families. Each data point is a 10-run average.&lt;/em&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;RTX 3080 (10GB)&lt;/th&gt;
&lt;th&gt;M1 Pro (32GB UMA)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Peak T/J (Qwen-3B Q4_K_M)&lt;/td&gt;
&lt;td&gt;0.90 T/J&lt;/td&gt;
&lt;td&gt;2.42 T/J&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-3.1-8B Q8_0 @ 8K ctx&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;22 t/s @ 35W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thermal throttling&lt;/td&gt;
&lt;td&gt;None (SM ≥ 1440 MHz)&lt;/td&gt;
&lt;td&gt;None (&amp;lt; 65°C)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Power draw&lt;/td&gt;
&lt;td&gt;~198–220W GPU board&lt;/td&gt;
&lt;td&gt;~35W whole-SoC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 3080 wins raw throughput on workloads that fit within its 10GB VRAM. The M1 Pro wins every efficiency metric — and can run workloads the 3080 physically cannot.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 13.7GB VRAM Boundary
&lt;/h2&gt;

&lt;p&gt;The most instructive finding wasn't an efficiency measurement. It was an infrastructure failure.&lt;/p&gt;

&lt;p&gt;Llama-3.1-8B at Q8_0 quantization with an 8192-token context window requires approximately 13.7GB of memory (model weights + KV cache). On the RTX 3080, this immediately triggers an Out-of-Memory crash. 10GB of GDDR6X is a hard physical ceiling — the workload cannot start.&lt;/p&gt;

&lt;p&gt;On the M1 Pro with 32GB of Unified Memory, the same workload runs at 22 tokens/second while drawing 35W of whole-SoC power, staying below 65°C with zero thermal throttling across 10+ minute sustained loads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs78bw5ksuxxqf29gu17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxs78bw5ksuxxqf29gu17.png" alt="The 13.7GB boundary: RTX 3080 OOMs, M1 Pro cruises at 22 t/s and 35W."&gt;&lt;/a&gt;&lt;em&gt;The 13.7GB boundary: RTX 3080 OOMs, M1 Pro cruises at 22 t/s and 35W.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This isn't an Apple-vs-NVIDIA argument. It's a memory architecture observation. When a workload exceeds discrete VRAM capacity, the hardware is out of the game regardless of compute throughput. Apple's UMA — and increasingly, Intel's shared memory architecture on Arc — sidesteps this by treating system RAM and GPU memory as a single pool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical mitigations for the VRAM boundary:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drop to Q4_K_M quantization (roughly halves memory at acceptable perplexity cost — our data shows Q4_K_M is the Pareto sweet spot)&lt;/li&gt;
&lt;li&gt;Reduce context window from 8192 to 2048 (significant KV cache savings)&lt;/li&gt;
&lt;li&gt;Use a larger-VRAM discrete card (RTX 3090 24GB, RTX 4090 24GB)&lt;/li&gt;
&lt;li&gt;For Apple Silicon: UMA makes this a non-issue up to your total system RAM&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The telemetry architecture uses a pluggable Abstract Base Class (&lt;code&gt;TelemetryProvider&lt;/code&gt;) with four contract methods: &lt;code&gt;get_hardware_info()&lt;/code&gt;, &lt;code&gt;start()&lt;/code&gt;, &lt;code&gt;stop()&lt;/code&gt;, and &lt;code&gt;get_cli_flags()&lt;/code&gt;. Each hardware vendor gets its own provider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NvidiaProvider&lt;/strong&gt; — &lt;code&gt;pynvml&lt;/code&gt; for GPU board power, temperature, VRAM, and SM clock speed at 500ms intervals via a daemon thread.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AppleSiliconProvider&lt;/strong&gt; — &lt;code&gt;sudo powermetrics&lt;/code&gt; with plist output for whole-SoC power. &lt;code&gt;psutil&lt;/code&gt; for per-PID RSS tracking (macOS doesn't expose GPU-specific memory for Metal workloads).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROCmProvider / IntelProvider&lt;/strong&gt; — Stub implementations with documented API surfaces (&lt;code&gt;rocm-smi&lt;/code&gt;, &lt;code&gt;xpu-smi&lt;/code&gt;). Adding a new backend is ~100 lines of code with no changes to the core benchmark logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The orchestrator spawns inference via &lt;code&gt;llama-cli&lt;/code&gt;, links the process PID to the telemetry provider for accurate memory tracking, and computes 95% confidence intervals from 10-run distributions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations and Open Questions
&lt;/h2&gt;

&lt;p&gt;T/J isn't always the primary metric. For low-latency interactive applications (chatbots, real-time coding assistants), raw tokens/sec directly affects user experience — and the RTX 3080 wins throughput on workloads that fit its VRAM. A 4090 or 5090 would win by even more.&lt;/p&gt;

&lt;p&gt;T/J becomes the dominant metric when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're running inference at scale and energy is a line item&lt;/li&gt;
&lt;li&gt;You're on a power-constrained device (laptop, edge, mobile)&lt;/li&gt;
&lt;li&gt;You're choosing hardware for batch/offline inference where latency isn't critical&lt;/li&gt;
&lt;li&gt;Sustainability and carbon footprint factor into procurement decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other limitations to note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sample size:&lt;/strong&gt; This is two devices. The efficiency frontier needs dozens of data points to be truly useful — which is why the repo is open for contributions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power measurement asymmetry:&lt;/strong&gt; Apple reports whole-SoC; NVIDIA reports GPU board only. We cannot make them identical without external metering hardware. We chose transparency over normalization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Driver and firmware dependency:&lt;/strong&gt; Results may vary across driver versions. The exact versions used are documented in the repo's hardware configuration files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload scope:&lt;/strong&gt; All benchmarks use text generation (autoregressive decoding). Prefill-heavy or batched workloads may shift the efficiency calculus.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Frontier Is Open
&lt;/h2&gt;

&lt;p&gt;The M1 Pro vs RTX 3080 data is a reference baseline, not a destination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apple M5 is live.&lt;/strong&gt; Does the M5's bandwidth leap translate to T/J gains, or is the M1 Pro already near the UMA efficiency ceiling?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVIDIA Blackwell is shipping.&lt;/strong&gt; Can the RTX 5090 and B200 close the 2.7× T/J gap with their new memory subsystems?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AMD ROCm is maturing.&lt;/strong&gt; Consumer RDNA3+ and data-center MI300X are completely unmapped in open efficiency benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intel Arc is emerging.&lt;/strong&gt; Arc's shared memory architecture offers a third data point between Apple UMA and traditional discrete VRAM.&lt;/p&gt;

&lt;p&gt;If you have access to any of this hardware, the suite takes about five minutes to set up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dilbersha/universal-llm-telemetry-suite
&lt;span class="nb"&gt;cd &lt;/span&gt;universal-llm-telemetry-suite
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements-apple-silicon.txt  &lt;span class="c"&gt;# or requirements.txt for NVIDIA&lt;/span&gt;
python src/download_models.py                  &lt;span class="c"&gt;# ~25GB of GGUF models&lt;/span&gt;
&lt;span class="nb"&gt;sudo&lt;/span&gt; ./venv/bin/python src/orchestrator.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Submit a PR with your &lt;code&gt;results/&amp;lt;hardware-slug&amp;gt;/&lt;/code&gt; folder to get featured in the global hardware ledger.&lt;/p&gt;

&lt;p&gt;The efficiency frontier is a community project. The data so far is two data points. The map needs all of them.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/dilbersha/universal-llm-telemetry-suite" rel="noopener noreferrer"&gt;github.com/dilbersha/universal-llm-telemetry-suite&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raw data:&lt;/strong&gt; &lt;code&gt;results/master_ledger.csv&lt;/code&gt;, &lt;code&gt;results/m1_pro/production_benchmarks.csv&lt;/code&gt;, &lt;code&gt;results/reference_benchmarks/rtx_3080_baseline/production_benchmarks.csv&lt;/code&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
