How I discovered a hidden 146W power draw on NVIDIA A100 GPUs (and built an open‑source fix)
TL;DR: nvidia-smi reported 0% utilization, but the GPU was drawing 146W. Standard telemetry lies. I built an open‑source detector and a new efficiency benchmark (CEI).
The moment I knew something was wrong
I was running a matrix multiplication benchmark on an NVIDIA A100 SXM (RunPod, my own money). After the kernel finished, nvidia-smi said:
- GPU utilization: 0%
- Power draw: 146.66 W
Not a spike. It stayed there for 11+ minutes. The GPU was locked in P0 state, memory clock stuck at 1593 MHz, burning electricity while reporting “idle”.
I tested sampling rates of 1 second, 100 milliseconds, and even 10 ms – the blind spot persisted.
This is a GHOST anomaly: physically impossible telemetry that leads to over‑provisioned clusters, wasted energy, and wrong scaling decisions.
What I did about it
I ran 35 hardware tests (24 A100, 11 H100) and validated:
- A100 idle floor is ~67 W, but ghost power can reach 146 W at 0% utilization.
- H100 shows no ghost power – the issue is A100‑specific (likely fixed in Hopper).
- NVIDIA’s own MIG documentation admits: “Profiling of shared GPU resources is not supported.” My tool fills that gap.
I defined Compute Energy Intensity (CEI) = FLOPs / joule.
Reference: A100 sustained FP32 → 5.68 B FLOPs/J (Test 24, 900 s).
Then I built the AI GPU Energy Optimizer – an open‑source platform that:
- Detects DESYNC/GHOST anomalies in real time.
- Provides CEI benchmarking across 17+ cloud providers (AWS, GCP, Azure, RunPod, etc.).
- Integrates with Kubernetes / Run:ai for auto‑eviction.
- Deploys with a single
docker-compose up.
✅ All 40 platform tests pass. Live API: ai-gpu-brain-v3.onrender.com/docs
Why this matters
Cloud providers and AI teams are paying for electricity they can’t see. At 500 GPUs, ghost waste can exceed $150/day in hidden energy + cooling.
The tool is open source, but I need sponsored compute (100‑500 GPUs on MIG partitions) to scale validation and prove the ROI. I’m an independent researcher in BC, Canada – all tests so far were at my own expense.
If you run GPU fleets or work at a cloud provider, let’s talk.
Resources
- 📄 Full white paper (detailed methodology, 35 tests, statistical confidence): github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md
- 💻 GitHub repo (open‑source, MIT‑licensed code): github.com/mikebains41-debug/ai-gpu-energy-optimizer-
- 🚀 Live API / Swagger: ai-gpu-brain-v3.onrender.com/docs
Tags: gpu ai opensource observability energyefficiency
– Mike Bains (mikebains41@gmail.com)
Top comments (0)