DEV Community: mikebains41-debug

From Ghost Power Discovery to Enterprise GPU Optimizer – How I Finished What I Started

mikebains41-debug — Sat, 23 May 2026 16:50:37 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

I built an open‑source GPU Energy Optimizer that detects a previously unknown telemetry anomaly: NVIDIA A100 GPUs draw 146.66W while reporting 0% utilization – sustained for 10+ minutes (Tests 13 & 14). I call this ghost power.

The project validates the anomaly with 35 hardware tests (24 A100 + 11 H100) and defines a new efficiency metric: CEI (Compute Energy Intensity) – FLOPs per joule. It includes a live API, a dashboard, a white paper, and a full enterprise‑scale architecture (TimescaleDB, batching, Prometheus, Kubernetes Helm, Morpheus pipeline).

🔗 GitHub repo (v1.0.1 release): github.com/mikebains41-debug/ai-gpu-energy-optimizer-

📄 White paper: WHITEPAPER.md

🚀 Live API: ai-gpu-brain-v3.onrender.com/docs

📊 Dashboard: ai-gpu-energy-optimizer.vercel.app

Demo

Before – early prototype (v0.1):

SQLite database (single file, no concurrency)
Direct HTTP POST per GPU (no batching)
No Prometheus metrics or dashboards
Manual deployment (docker‑compose only)
35 hardware tests, but no automated platform suite

After – production‑ready (v1.0.1 – final release):

TimescaleDB (PostgreSQL + hypertables, continuous aggregates)
Batched agent (30s windows, Redis queue, Celery workers)
Prometheus exporter + Grafana dashboards (native GHOST metric)
Kubernetes Helm chart + DaemonSet for agent deployment
Morpheus pipeline for real‑time anomaly detection and auto‑alert
75 tests (35 hardware + 40 platform) passing in CI

(Add your screenshots here)

The Comeback Story

Where the project was before:

I started this as a personal validation on RunPod, running 24 A100 tests from my Samsung phone using Termux. The code was a loose collection of scripts, a single FastAPI instance with SQLite, and no scalability. It proved the anomaly existed – but it wasn’t ready for real fleets.

What I changed, fixed, and added to finish it (v1.0.1):

Over the past month, I rewrote the entire stack:

Database: Migrated from SQLite to TimescaleDB (hypertables, continuous aggregates).
Agent: Added batching, retries, and async sending (30s windows).
Queuing: Integrated Redis + Celery to decouple ingestion from processing.
Observability: Built a Prometheus exporter with GHOST/DESYNC metrics.
Orchestration: Created a Kubernetes Helm chart and DaemonSet for agent deployment.
AI Pipeline: Wrote a Morpheus pipeline that pulls live API data, scores CEI, and auto‑alerts.
Testing: Grew from 35 hardware tests to 75 total (including 40 platform validation tests).

The finishing moment was running the full enterprise test suite on a simulated 1000‑GPU cluster (using the new Morpheus test harness) and seeing all 30 M1‑M30 tests pass – then tagging the v1.0.1 release on GitHub.

My Experience with GitHub Copilot

I used AI assistance (including GitHub Copilot) throughout the rewrite:

Copilot suggested the TimescaleDB hypertable syntax and the best indexing strategies for time‑partitioned data.
It auto‑completed the batched agent’s async methods – saving hours of debugging asyncio edge cases.
When writing the Helm chart, Copilot generated the correct YAML structure for GPU node tolerations and volume mounts.
For the Morpheus pipeline, it filled in the boilerplate for the GpuTelemetryProcessorStage and the CEI scoring logic.
It also helped refactor the monolithic main.py into modular models.py, prometheus_metrics.py, and morpheus/pipeline.py.

The most valuable part was pair‑debugging: I’d describe an error (e.g., SQLAlchemy connection pool timeouts), and Copilot would suggest the fix (adding pool_pre_ping=True). Without this, finishing the enterprise stack would have taken twice as long.

This project is my proof that a solo developer – even from a phone – can build production‑grade infrastructure. The “finish” isn’t the end; it’s the foundation for scaling to 1000 GPUs and beyond.

AI tools were used in drafting this article and generating code.

Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring

mikebains41-debug — Sat, 23 May 2026 16:32:49 +0000

This is a submission for the Google I/O Writing Challenge

What I Found Building a Real Benchmark

My project — the AI GPU Energy Optimizer — measures something the industry largely ignores: what GPUs consume when they're doing nothing. We call it ghost power.

On an NVIDIA A100 SXM running on RunPod infrastructure, I measured:

Idle floor: 67W — the baseline you pay for just having the GPU allocated
Ghost power: up to 146W at 0% compute utilization — power draw with no workload running
FP16 vs FP32 delta: 483W vs 302W — a 60% power spike just from switching precision

That 146W ghost power figure isn't a bug. It's the cost of persistence mode, memory controller activity, and thermal management keeping the chip "ready." On a single GPU it's noise. At a million‑unit scale, it's infrastructure.

The Gap in Google's Story

Google's 2x performance‑per‑watt claim almost certainly measures peak compute throughput under load. That's the right number for training benchmarks. But it doesn't capture:

Idle energy floor — what you pay between inference requests
Ghost power — the overhead of allocation without utilization
Precision‑mode energy delta — the cost of switching between FP8, FP16, FP32
Per‑request energy amortization — especially relevant for real‑time inference at low batch sizes

For batch training at scale, Google's metric is exactly right. But for inference serving — the workload that's actually growing fastest — idle behavior dominates total cost. A model serving 10 requests per second on a 300W GPU is spending most of its energy budget on ghost power, not compute.

What This Means for Developers

If you're building on Google Cloud GPU infrastructure — or any cloud GPU provider — three things from I/O 2026 matter for your energy costs:

Performance‑per‑watt is now a first‑class metric. Google made it explicit in the keynote. That means cloud providers will start surfacing it, and you should be asking for it in your SLAs.
Batch size is your energy lever. At low utilization, ghost power dominates. The single highest‑impact thing you can do is increase batch size to push utilization above idle thresholds. This is true on TPUs, A100s, and H100s.
Precision choice has a power cost. My benchmarks showed FP16 drawing 60% more power than FP32 on the same hardware. FP8 is even more aggressive. Before you optimize for speed with lower precision, measure whether your infrastructure can absorb the power delta.

The Bigger Picture

Google's I/O 2026 TPU announcement signals that the industry is finally treating energy efficiency as a first‑order constraint, not an afterthought. The move from "faster is better" to "more compute per watt" is the right framing for where AI infrastructure is heading.

But the measurement frameworks haven't caught up. Performance‑per‑watt at peak load is a starting point. What the field needs is a complete picture: idle floor, ghost power, precision‑mode deltas, and per‑request amortization — especially as inference workloads diversify across real‑time and batch use cases.

That's what I've been building toward. And Google I/O 2026 just made the conversation mainstream.

The AI GPU Energy Optimizer is open‑source and available on GitHub. It includes 75 validated tests across A100 and H100 hardware, with the Morpheus test suite covering ghost detection, CEI scoring, multi‑GPU scaling, and production infrastructure validation.

📄 White paper: WHITEPAPER.md

Live API: ai-gpu-brain-v3.onrender.com/docs

AI tools were used in drafting and refining this article.

"How I discovered a hidden 146W power draw on NVIDIA A100 GPUs (and built an open‑source fix)”

mikebains41-debug — Wed, 20 May 2026 02:17:23 +0000

How I discovered a hidden 146W power draw on NVIDIA A100 GPUs (and built an open‑source fix)

TL;DR: nvidia-smi reported 0% utilization, but the GPU was drawing 146W. Standard telemetry lies. I built an open‑source detector and a new efficiency benchmark (CEI).

The moment I knew something was wrong

I was running a matrix multiplication benchmark on an NVIDIA A100 SXM (RunPod, my own money). After the kernel finished, nvidia-smi said:

GPU utilization: 0%
Power draw: 146.66 W

Not a spike. It stayed there for 11+ minutes. The GPU was locked in P0 state, memory clock stuck at 1593 MHz, burning electricity while reporting “idle”.

I tested sampling rates of 1 second, 100 milliseconds, and even 10 ms – the blind spot persisted.

This is a GHOST anomaly: physically impossible telemetry that leads to over‑provisioned clusters, wasted energy, and wrong scaling decisions.

What I did about it

I ran 35 hardware tests (24 A100, 11 H100) and validated:

A100 idle floor is ~67 W, but ghost power can reach 146 W at 0% utilization.
H100 shows no ghost power – the issue is A100‑specific (likely fixed in Hopper).
NVIDIA’s own MIG documentation admits: “Profiling of shared GPU resources is not supported.” My tool fills that gap.

I defined Compute Energy Intensity (CEI) = FLOPs / joule.

Reference: A100 sustained FP32 → 5.68 B FLOPs/J (Test 24, 900 s).

Then I built the AI GPU Energy Optimizer – an open‑source platform that:

Detects DESYNC/GHOST anomalies in real time.
Provides CEI benchmarking across 17+ cloud providers (AWS, GCP, Azure, RunPod, etc.).
Integrates with Kubernetes / Run:ai for auto‑eviction.
Deploys with a single docker-compose up.

✅ All 40 platform tests pass. Live API: ai-gpu-brain-v3.onrender.com/docs

Why this matters

Cloud providers and AI teams are paying for electricity they can’t see. At 500 GPUs, ghost waste can exceed $150/day in hidden energy + cooling.

The tool is open source, but I need sponsored compute (100‑500 GPUs on MIG partitions) to scale validation and prove the ROI. I’m an independent researcher in BC, Canada – all tests so far were at my own expense.

If you run GPU fleets or work at a cloud provider, let’s talk.

Resources

📄 Full white paper (detailed methodology, 35 tests, statistical confidence): github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md
💻 GitHub repo (open‑source, MIT‑licensed code): github.com/mikebains41-debug/ai-gpu-energy-optimizer-
🚀 Live API / Swagger: ai-gpu-brain-v3.onrender.com/docs

Tags: gpu ai opensource observability energyefficiency

– Mike Bains (mikebains41@gmail.com)

Title: I Built a Production GPU Energy Optimizer in One Day — From My Phone

mikebains41-debug — Sun, 17 May 2026 19:24:31 +0000

I Built a Production GPU Energy Optimizer in One Day — From My Phone

Not from a MacBook. Not from a cloud VM. From my Android phone,
using Termux.

Here's what shipped by end of day:

Real-time GPU energy dashboard
DESYNC & GHOST power anomaly detection
17 cloud provider support
Per-user API keys
Time-series metrics scaling to 100+ GPUs
18/18 smoke tests passing
60-second Docker install

The Problem

GPU providers lie. Not intentionally — but telemetry desync is real.

Two failure modes kill your energy budget:

DESYNC — GPU drawing 420W but reporting 8% utilization.
You're paying full price for a GPU doing nothing useful.

GHOST power — GPU reporting 98% utilization at 40W draw.
Physically impossible. Your scheduler is making decisions on
fake data.

We found both in the wild across AWS and Vast.ai during testing.

The Solution

An open validation stack that:

Detects DESYNC and GHOST anomalies automatically
Works across 17 GPU cloud providers
Evicts bad workloads via Kubernetes or Run:ai
Alerts via Slack
Stores time-series data for 100+ GPUs

What We Built

Component	Status
CEI Formal Specification	✅
Grafana Dashboard	✅
GPU Agent Script	✅
Per-user API Keys	✅
Time-series DB	✅
17-Provider Validator	✅
Smoke Test 18/18	✅
Docker one-liner	✅

Why Termux

No laptop. No cloud IDE. Just an Android phone with Termux.

This matters because it proves the stack is lightweight enough
to run anywhere. If it builds and runs on a phone, it runs on
any bare metal server, VPS, or edge node.

60-Second Install


bash
 Install Docker (skip if already installed)
curl -fsSL https://get.docker.com | sh

 Clone and run
git clone https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-
cd ai-gpu-energy-optimizer-
docker-compose up