Dilber

Posted on Apr 21 • Originally published at dilber.hashnode.dev

Why Tokens Per Joule Matters More Than Tokens Per Second

#ai #python #machinelearning #opensource

Most GPU benchmarks report tokens/sec, but that metric ignores the dominant driver of real-world inference cost: energy. I built a cross-platform telemetry suite to measure Tokens Per Joule (T/J) — tokens/sec ÷ watts — alongside throughput. Think of it as miles per gallon for inference. The reference data across Apple Silicon and NVIDIA challenges some common assumptions about hardware selection.

TL;DR

Metric: Tokens Per Joule (T/J) = tokens/sec ÷ watts. The inference equivalent of miles per gallon.
Finding: Apple M1 Pro achieves 2.42 T/J vs NVIDIA RTX 3080 at 0.90 T/J — a 2.7× energy efficiency gap on identical workloads.
Surprise: A 13.7GB model (Llama-3.1-8B Q8_0 at 8192 context) runs fine on M1 Pro's unified memory but OOMs on the 3080's 10GB VRAM.
Methodology: 11 GGUF models, 3 context windows, 10 runs per config, 95% confidence intervals, automated WikiText-2 perplexity validation.
Open: Pluggable architecture — adding AMD/Intel is ~100 LOC. Hardware ledger accepting community PRs.

Repo: github.com/dilberx/universal-llm-telemetry-suite

Why Tokens Per Joule?

Tokens per second tells you how fast a GPU generates text. It doesn't tell you how much that speed costs.

Tokens Per Joule = Tokens/Sec ÷ Watts.

Here's what that looks like in practice. At 1M tokens/day inference load:

M1 Pro  (2.42 T/J):  1,000,000 / 2.42 = 413,223 Joules = 0.115 kWh/day
RTX 3080 (0.90 T/J): 1,000,000 / 0.90 = 1,111,111 Joules = 0.309 kWh/day

At $0.12/kWh, that's roughly $0.014/day vs $0.037/day — a 2.7× difference. Small on a single machine. But across a fleet of inference nodes running 24/7, or against a tight power budget on edge hardware, the gap compounds fast.

For any team where energy cost or power budget matters — edge devices, on-premise clusters, sustainability-conscious deployments — T/J is the metric that maps directly to operational cost.

Power Measurement Methodology

Before showing the data, I want to be upfront about how power is measured — because this is the single biggest source of confusion in cross-platform GPU benchmarks.

NVIDIA RTX 3080: Power is read via pynvml (nvmlDeviceGetPowerUsage), which reports GPU board power (TBP) in milliwatts. This does NOT include CPU, system RAM, or PSU losses. Sampled at 500ms intervals via a daemon thread.

Apple M1 Pro: Power is read via sudo powermetrics, which reports whole-SoC power — CPU + GPU + memory controller + IO. This is a broader measurement than NVIDIA's. Sampled at 500ms intervals.

What this means: The Apple measurement is more inclusive. If anything, this disadvantages Apple in the comparison — if we added CPU idle power and memory controller power to the NVIDIA measurement, the efficiency gap would likely be wider. We report exactly what each vendor's API provides. No adjustments.

Temperature, VRAM/memory usage, and clock speeds are logged continuously alongside power into thermal_log.csv for every run.

The Reference Data

Hardware:

NVIDIA RTX 3080 10GB GDDR6X (Linux, CUDA, latest stable driver)
Apple M1 Pro 32GB UMA (macOS, Metal via llama.cpp)

Software: llama.cpp (Metal and CUDA builds), models in GGUF format (a binary format optimized for fast loading and inference with quantized LLMs) from Hugging Face.

Workload: 11 GGUF models (Qwen-2.5-3B, Mistral-7B, Llama-3.1-8B across Q4_K_M, Q5_K_M, and Q8_0 quantizations — where Q8_0 means 8-bit integer weights with no group scaling, the highest-fidelity quantization that still fits in reduced memory), 3 context window sizes (512, 2048, 8192 tokens), 10 runs per configuration, 95% confidence intervals computed from the distribution. WikiText-2 perplexity measured alongside throughput to verify that quantization-driven speed gains don't degrade output quality.

The Efficiency Frontier: M1 Pro clusters at 2–3× higher T/J across all model families. Each data point is a 10-run average.

Metric	RTX 3080 (10GB)	M1 Pro (32GB UMA)
Peak T/J (Qwen-3B Q4_K_M)	0.90 T/J	2.42 T/J
Llama-3.1-8B Q8_0 @ 8K ctx	OOM	22 t/s @ 35W
Thermal throttling	None (SM ≥ 1440 MHz)	None (< 65°C)
Power draw	~198–220W GPU board	~35W whole-SoC

The 3080 wins raw throughput on workloads that fit within its 10GB VRAM. The M1 Pro wins every efficiency metric — and can run workloads the 3080 physically cannot.

The 13.7GB VRAM Boundary

The most instructive finding wasn't an efficiency measurement. It was an infrastructure failure.

Llama-3.1-8B at Q8_0 quantization with an 8192-token context window requires approximately 13.7GB of memory (model weights + KV cache). On the RTX 3080, this immediately triggers an Out-of-Memory crash. 10GB of GDDR6X is a hard physical ceiling — the workload cannot start.

On the M1 Pro with 32GB of Unified Memory, the same workload runs at 22 tokens/second while drawing 35W of whole-SoC power, staying below 65°C with zero thermal throttling across 10+ minute sustained loads.

The 13.7GB boundary: RTX 3080 OOMs, M1 Pro cruises at 22 t/s and 35W.

This isn't an Apple-vs-NVIDIA argument. It's a memory architecture observation. When a workload exceeds discrete VRAM capacity, the hardware is out of the game regardless of compute throughput. Apple's UMA — and increasingly, Intel's shared memory architecture on Arc — sidesteps this by treating system RAM and GPU memory as a single pool.

Practical mitigations for the VRAM boundary:

Drop to Q4_K_M quantization (roughly halves memory at acceptable perplexity cost — our data shows Q4_K_M is the Pareto sweet spot)
Reduce context window from 8192 to 2048 (significant KV cache savings)
Use a larger-VRAM discrete card (RTX 3090 24GB, RTX 4090 24GB)
For Apple Silicon: UMA makes this a non-issue up to your total system RAM

The Architecture

The telemetry architecture uses a pluggable Abstract Base Class (TelemetryProvider) with four contract methods: get_hardware_info(), start(), stop(), and get_cli_flags(). Each hardware vendor gets its own provider:

NvidiaProvider — pynvml for GPU board power, temperature, VRAM, and SM clock speed at 500ms intervals via a daemon thread.
AppleSiliconProvider — sudo powermetrics with plist output for whole-SoC power. psutil for per-PID RSS tracking (macOS doesn't expose GPU-specific memory for Metal workloads).
ROCmProvider / IntelProvider — Stub implementations with documented API surfaces (rocm-smi, xpu-smi). Adding a new backend is ~100 lines of code with no changes to the core benchmark logic.

The orchestrator spawns inference via llama-cli, links the process PID to the telemetry provider for accurate memory tracking, and computes 95% confidence intervals from 10-run distributions.

Limitations and Open Questions

T/J isn't always the primary metric. For low-latency interactive applications (chatbots, real-time coding assistants), raw tokens/sec directly affects user experience — and the RTX 3080 wins throughput on workloads that fit its VRAM. A 4090 or 5090 would win by even more.

T/J becomes the dominant metric when:

You're running inference at scale and energy is a line item
You're on a power-constrained device (laptop, edge, mobile)
You're choosing hardware for batch/offline inference where latency isn't critical
Sustainability and carbon footprint factor into procurement decisions

Other limitations to note:

Sample size: This is two devices. The efficiency frontier needs dozens of data points to be truly useful — which is why the repo is open for contributions.
Power measurement asymmetry: Apple reports whole-SoC; NVIDIA reports GPU board only. We cannot make them identical without external metering hardware. We chose transparency over normalization.
Driver and firmware dependency: Results may vary across driver versions. The exact versions used are documented in the repo's hardware configuration files.
Workload scope: All benchmarks use text generation (autoregressive decoding). Prefill-heavy or batched workloads may shift the efficiency calculus.

The Frontier Is Open

The M1 Pro vs RTX 3080 data is a reference baseline, not a destination.

Apple M5 is live. Does the M5's bandwidth leap translate to T/J gains, or is the M1 Pro already near the UMA efficiency ceiling?

NVIDIA Blackwell is shipping. Can the RTX 5090 and B200 close the 2.7× T/J gap with their new memory subsystems?

AMD ROCm is maturing. Consumer RDNA3+ and data-center MI300X are completely unmapped in open efficiency benchmarks.

Intel Arc is emerging. Arc's shared memory architecture offers a third data point between Apple UMA and traditional discrete VRAM.

If you have access to any of this hardware, the suite takes about five minutes to set up:

git clone https://github.com/dilbersha/universal-llm-telemetry-suite
cd universal-llm-telemetry-suite
python3 -m venv venv && source venv/bin/activate
pip install -r requirements-apple-silicon.txt  # or requirements.txt for NVIDIA
python src/download_models.py                  # ~25GB of GGUF models
sudo ./venv/bin/python src/orchestrator.py

Submit a PR with your results/<hardware-slug>/ folder to get featured in the global hardware ledger.

The efficiency frontier is a community project. The data so far is two data points. The map needs all of them.

Repo: github.com/dilbersha/universal-llm-telemetry-suite

Raw data: results/master_ledger.csv, results/m1_pro/production_benchmarks.csv, results/reference_benchmarks/rtx_3080_baseline/production_benchmarks.csv