<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: mikebains41-debug</title>
    <description>The latest articles on DEV Community by mikebains41-debug (@mikebains41debug).</description>
    <link>https://dev.to/mikebains41debug</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935300%2Fe33f16ab-f6fc-4f58-9f66-b356775a2ca6.png</url>
      <title>DEV Community: mikebains41-debug</title>
      <link>https://dev.to/mikebains41debug</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mikebains41debug"/>
    <language>en</language>
    <item>
      <title>From Ghost Power Discovery to Enterprise GPU Optimizer – How I Finished What I Started</title>
      <dc:creator>mikebains41-debug</dc:creator>
      <pubDate>Sat, 23 May 2026 16:50:37 +0000</pubDate>
      <link>https://dev.to/mikebains41debug/from-ghost-power-discovery-to-enterprise-gpu-optimizer-how-i-finished-what-i-started-532e</link>
      <guid>https://dev.to/mikebains41debug/from-ghost-power-discovery-to-enterprise-gpu-optimizer-how-i-finished-what-i-started-532e</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/github-2026-05-21"&gt;GitHub Finish-Up-A-Thon Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;I built an &lt;strong&gt;open‑source GPU Energy Optimizer&lt;/strong&gt; that detects a previously unknown telemetry anomaly: NVIDIA A100 GPUs draw &lt;strong&gt;146.66W while reporting 0% utilization&lt;/strong&gt; – sustained for 10+ minutes (Tests 13 &amp;amp; 14). I call this &lt;em&gt;ghost power&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The project validates the anomaly with &lt;strong&gt;35 hardware tests&lt;/strong&gt; (24 A100 + 11 H100) and defines a new efficiency metric: &lt;strong&gt;CEI (Compute Energy Intensity)&lt;/strong&gt; – FLOPs per joule. It includes a live API, a dashboard, a white paper, and a full enterprise‑scale architecture (TimescaleDB, batching, Prometheus, Kubernetes Helm, Morpheus pipeline).&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;GitHub repo (v1.0.1 release):&lt;/strong&gt; &lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-" rel="noopener noreferrer"&gt;github.com/mikebains41-debug/ai-gpu-energy-optimizer-&lt;/a&gt;&lt;br&gt;&lt;br&gt;
📄 &lt;strong&gt;White paper:&lt;/strong&gt; &lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md" rel="noopener noreferrer"&gt;WHITEPAPER.md&lt;/a&gt;&lt;br&gt;&lt;br&gt;
🚀 &lt;strong&gt;Live API:&lt;/strong&gt; &lt;a href="https://ai-gpu-brain-v3.onrender.com/docs" rel="noopener noreferrer"&gt;ai-gpu-brain-v3.onrender.com/docs&lt;/a&gt;&lt;br&gt;&lt;br&gt;
📊 &lt;strong&gt;Dashboard:&lt;/strong&gt; &lt;a href="https://ai-gpu-energy-optimizer.vercel.app" rel="noopener noreferrer"&gt;ai-gpu-energy-optimizer.vercel.app&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before – early prototype (v0.1):&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQLite database (single file, no concurrency)
&lt;/li&gt;
&lt;li&gt;Direct HTTP POST per GPU (no batching)
&lt;/li&gt;
&lt;li&gt;No Prometheus metrics or dashboards
&lt;/li&gt;
&lt;li&gt;Manual deployment (docker‑compose only)
&lt;/li&gt;
&lt;li&gt;35 hardware tests, but no automated platform suite
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After – production‑ready (v1.0.1 – final release):&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TimescaleDB (PostgreSQL + hypertables, continuous aggregates)
&lt;/li&gt;
&lt;li&gt;Batched agent (30s windows, Redis queue, Celery workers)
&lt;/li&gt;
&lt;li&gt;Prometheus exporter + Grafana dashboards (native GHOST metric)
&lt;/li&gt;
&lt;li&gt;Kubernetes Helm chart + DaemonSet for agent deployment
&lt;/li&gt;
&lt;li&gt;Morpheus pipeline for real‑time anomaly detection and auto‑alert
&lt;/li&gt;
&lt;li&gt;75 tests (35 hardware + 40 platform) passing in CI
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(Add your screenshots here)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Comeback Story
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Where the project was before:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
I started this as a personal validation on RunPod, running 24 A100 tests from my Samsung phone using Termux. The code was a loose collection of scripts, a single FastAPI instance with SQLite, and no scalability. It proved the anomaly existed – but it wasn’t ready for real fleets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I changed, fixed, and added to finish it (v1.0.1):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Over the past month, I rewrote the entire stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Database:&lt;/strong&gt; Migrated from SQLite to TimescaleDB (hypertables, continuous aggregates).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent:&lt;/strong&gt; Added batching, retries, and async sending (30s windows).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queuing:&lt;/strong&gt; Integrated Redis + Celery to decouple ingestion from processing.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability:&lt;/strong&gt; Built a Prometheus exporter with GHOST/DESYNC metrics.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration:&lt;/strong&gt; Created a Kubernetes Helm chart and DaemonSet for agent deployment.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Pipeline:&lt;/strong&gt; Wrote a Morpheus pipeline that pulls live API data, scores CEI, and auto‑alerts.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing:&lt;/strong&gt; Grew from 35 hardware tests to 75 total (including 40 platform validation tests).
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;finishing moment&lt;/strong&gt; was running the full enterprise test suite on a simulated 1000‑GPU cluster (using the new Morpheus test harness) and seeing &lt;strong&gt;all 30 M1‑M30 tests pass&lt;/strong&gt; – then tagging the &lt;strong&gt;v1.0.1 release&lt;/strong&gt; on GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Experience with GitHub Copilot
&lt;/h2&gt;

&lt;p&gt;I used AI assistance (including GitHub Copilot) throughout the rewrite:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Copilot suggested the TimescaleDB hypertable syntax&lt;/strong&gt; and the best indexing strategies for time‑partitioned data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It auto‑completed the batched agent’s async methods&lt;/strong&gt; – saving hours of debugging &lt;code&gt;asyncio&lt;/code&gt; edge cases.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;When writing the Helm chart&lt;/strong&gt;, Copilot generated the correct YAML structure for GPU node tolerations and volume mounts.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For the Morpheus pipeline&lt;/strong&gt;, it filled in the boilerplate for the &lt;code&gt;GpuTelemetryProcessorStage&lt;/code&gt; and the CEI scoring logic.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It also helped refactor the monolithic &lt;code&gt;main.py&lt;/code&gt;&lt;/strong&gt; into modular &lt;code&gt;models.py&lt;/code&gt;, &lt;code&gt;prometheus_metrics.py&lt;/code&gt;, and &lt;code&gt;morpheus/pipeline.py&lt;/code&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most valuable part was &lt;strong&gt;pair‑debugging&lt;/strong&gt;: I’d describe an error (e.g., SQLAlchemy connection pool timeouts), and Copilot would suggest the fix (adding &lt;code&gt;pool_pre_ping=True&lt;/code&gt;). Without this, finishing the enterprise stack would have taken twice as long.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project is my proof that a solo developer – even from a phone – can build production‑grade infrastructure. The “finish” isn’t the end; it’s the foundation for scaling to 1000 GPUs and beyond.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI tools were used in drafting this article and generating code.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
    </item>
    <item>
      <title>Google's 2x Energy Efficiency Claim Is Real — But Here's What They're Not Measuring</title>
      <dc:creator>mikebains41-debug</dc:creator>
      <pubDate>Sat, 23 May 2026 16:32:49 +0000</pubDate>
      <link>https://dev.to/mikebains41debug/googles-2x-energy-efficiency-claim-is-real-but-heres-what-theyre-not-measuring-nik</link>
      <guid>https://dev.to/mikebains41debug/googles-2x-energy-efficiency-claim-is-real-but-heres-what-theyre-not-measuring-nik</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-io-writing-2026-05-19"&gt;Google I/O Writing Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Found Building a Real Benchmark
&lt;/h2&gt;

&lt;p&gt;My project — the &lt;strong&gt;AI GPU Energy Optimizer&lt;/strong&gt; — measures something the industry largely ignores: what GPUs consume when they're doing nothing. We call it &lt;em&gt;ghost power&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;On an NVIDIA A100 SXM running on RunPod infrastructure, I measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idle floor: 67W&lt;/strong&gt; — the baseline you pay for just having the GPU allocated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ghost power: up to 146W at 0% compute utilization&lt;/strong&gt; — power draw with no workload running&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 vs FP32 delta: 483W vs 302W&lt;/strong&gt; — a 60% power spike just from switching precision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That 146W ghost power figure isn't a bug. It's the cost of persistence mode, memory controller activity, and thermal management keeping the chip "ready." On a single GPU it's noise. At a million‑unit scale, it's infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap in Google's Story
&lt;/h2&gt;

&lt;p&gt;Google's 2x performance‑per‑watt claim almost certainly measures peak compute throughput under load. That's the right number for training benchmarks. But it doesn't capture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idle energy floor&lt;/strong&gt; — what you pay between inference requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ghost power&lt;/strong&gt; — the overhead of allocation without utilization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision‑mode energy delta&lt;/strong&gt; — the cost of switching between FP8, FP16, FP32&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per‑request energy amortization&lt;/strong&gt; — especially relevant for real‑time inference at low batch sizes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For batch training at scale, Google's metric is exactly right. But for inference serving — the workload that's actually growing fastest — idle behavior dominates total cost. A model serving 10 requests per second on a 300W GPU is spending most of its energy budget on ghost power, not compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Developers
&lt;/h2&gt;

&lt;p&gt;If you're building on Google Cloud GPU infrastructure — or any cloud GPU provider — three things from I/O 2026 matter for your energy costs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Performance‑per‑watt is now a first‑class metric.&lt;/strong&gt; Google made it explicit in the keynote. That means cloud providers will start surfacing it, and you should be asking for it in your SLAs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Batch size is your energy lever.&lt;/strong&gt; At low utilization, ghost power dominates. The single highest‑impact thing you can do is increase batch size to push utilization above idle thresholds. This is true on TPUs, A100s, and H100s.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Precision choice has a power cost.&lt;/strong&gt; My benchmarks showed FP16 drawing 60% more power than FP32 on the same hardware. FP8 is even more aggressive. Before you optimize for speed with lower precision, measure whether your infrastructure can absorb the power delta.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Google's I/O 2026 TPU announcement signals that the industry is finally treating energy efficiency as a first‑order constraint, not an afterthought. The move from "faster is better" to "more compute per watt" is the right framing for where AI infrastructure is heading.&lt;/p&gt;

&lt;p&gt;But the measurement frameworks haven't caught up. Performance‑per‑watt at peak load is a starting point. What the field needs is a complete picture: idle floor, ghost power, precision‑mode deltas, and per‑request amortization — especially as inference workloads diversify across real‑time and batch use cases.&lt;/p&gt;

&lt;p&gt;That's what I've been building toward. And Google I/O 2026 just made the conversation mainstream.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;The AI GPU Energy Optimizer is open‑source and available on &lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. It includes 75 validated tests across A100 and H100 hardware, with the Morpheus test suite covering ghost detection, CEI scoring, multi‑GPU scaling, and production infrastructure validation.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;📄 &lt;strong&gt;White paper:&lt;/strong&gt; &lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md" rel="noopener noreferrer"&gt;WHITEPAPER.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Live API: &lt;a href="https://ai-gpu-brain-v3.onrender.com/docs" rel="noopener noreferrer"&gt;ai-gpu-brain-v3.onrender.com/docs&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;AI tools were used in drafting and refining this article.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>googleiochallenge</category>
    </item>
    <item>
      <title>"How I discovered a hidden 146W power draw on NVIDIA A100 GPUs (and built an open‑source fix)”</title>
      <dc:creator>mikebains41-debug</dc:creator>
      <pubDate>Wed, 20 May 2026 02:17:23 +0000</pubDate>
      <link>https://dev.to/mikebains41debug/how-i-discovered-a-hidden-146w-power-draw-on-nvidia-a100-gpus-and-built-an-open-source-fix-1n8h</link>
      <guid>https://dev.to/mikebains41debug/how-i-discovered-a-hidden-146w-power-draw-on-nvidia-a100-gpus-and-built-an-open-source-fix-1n8h</guid>
      <description>&lt;p&gt;&lt;strong&gt;How I discovered a hidden 146W power draw on NVIDIA A100 GPUs (and built an open‑source fix)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;TL;DR: &lt;code&gt;nvidia-smi&lt;/code&gt; reported 0% utilization, but the GPU was drawing 146W. Standard telemetry lies. I built an open‑source detector and a new efficiency benchmark (CEI).&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The moment I knew something was wrong
&lt;/h2&gt;

&lt;p&gt;I was running a matrix multiplication benchmark on an NVIDIA A100 SXM (RunPod, my own money). After the kernel finished, &lt;code&gt;nvidia-smi&lt;/code&gt; said:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU utilization:&lt;/strong&gt; 0%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power draw:&lt;/strong&gt; 146.66 W&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not a spike. It stayed there for 11+ minutes. The GPU was &lt;em&gt;locked in P0 state&lt;/em&gt;, memory clock stuck at 1593 MHz, burning electricity while reporting “idle”.&lt;/p&gt;

&lt;p&gt;I tested sampling rates of 1 second, 100 milliseconds, and even 10 ms – the blind spot persisted.&lt;/p&gt;

&lt;p&gt;This is a &lt;strong&gt;GHOST anomaly&lt;/strong&gt;: physically impossible telemetry that leads to over‑provisioned clusters, wasted energy, and wrong scaling decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I did about it
&lt;/h2&gt;

&lt;p&gt;I ran &lt;strong&gt;35 hardware tests&lt;/strong&gt; (24 A100, 11 H100) and validated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A100 idle floor is ~67 W, but ghost power can reach &lt;strong&gt;146 W&lt;/strong&gt; at 0% utilization.&lt;/li&gt;
&lt;li&gt;H100 shows no ghost power – the issue is A100‑specific (likely fixed in Hopper).&lt;/li&gt;
&lt;li&gt;NVIDIA’s own MIG documentation admits: &lt;em&gt;“Profiling of shared GPU resources is not supported.”&lt;/em&gt; My tool fills that gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I defined &lt;strong&gt;Compute Energy Intensity (CEI)&lt;/strong&gt; = FLOPs / joule.&lt;br&gt;&lt;br&gt;
Reference: A100 sustained FP32 → &lt;strong&gt;5.68 B FLOPs/J&lt;/strong&gt; (Test 24, 900 s).&lt;/p&gt;

&lt;p&gt;Then I built the &lt;strong&gt;AI GPU Energy Optimizer&lt;/strong&gt; – an open‑source platform that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects DESYNC/GHOST anomalies in real time.&lt;/li&gt;
&lt;li&gt;Provides CEI benchmarking across 17+ cloud providers (AWS, GCP, Azure, RunPod, etc.).&lt;/li&gt;
&lt;li&gt;Integrates with Kubernetes / Run:ai for auto‑eviction.&lt;/li&gt;
&lt;li&gt;Deploys with a single &lt;code&gt;docker-compose up&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✅ All 40 platform tests pass. Live API: &lt;a href="https://ai-gpu-brain-v3.onrender.com/docs" rel="noopener noreferrer"&gt;ai-gpu-brain-v3.onrender.com/docs&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Cloud providers and AI teams are paying for electricity they can’t see. At 500 GPUs, ghost waste can exceed &lt;strong&gt;$150/day&lt;/strong&gt; in hidden energy + cooling.&lt;/p&gt;

&lt;p&gt;The tool is open source, but I need &lt;strong&gt;sponsored compute&lt;/strong&gt; (100‑500 GPUs on MIG partitions) to scale validation and prove the ROI. I’m an independent researcher in BC, Canada – all tests so far were at my own expense.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you run GPU fleets or work at a cloud provider, let’s talk.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;📄 &lt;strong&gt;Full white paper&lt;/strong&gt; (detailed methodology, 35 tests, statistical confidence):
&lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md" rel="noopener noreferrer"&gt;github.com/mikebains41-debug/ai-gpu-energy-optimizer-/blob/main/WHITEPAPER.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;GitHub repo&lt;/strong&gt; (open‑source, MIT‑licensed code):
&lt;a href="https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-" rel="noopener noreferrer"&gt;github.com/mikebains41-debug/ai-gpu-energy-optimizer-&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🚀 &lt;strong&gt;Live API / Swagger&lt;/strong&gt;:
&lt;a href="https://ai-gpu-brain-v3.onrender.com/docs" rel="noopener noreferrer"&gt;ai-gpu-brain-v3.onrender.com/docs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;gpu&lt;/code&gt; &lt;code&gt;ai&lt;/code&gt; &lt;code&gt;opensource&lt;/code&gt; &lt;code&gt;observability&lt;/code&gt; &lt;code&gt;energyefficiency&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;– Mike Bains (&lt;a href="mailto:mikebains41@gmail.com"&gt;mikebains41@gmail.com&lt;/a&gt;)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>opensource</category>
      <category>performance</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Title: I Built a Production GPU Energy Optimizer in One Day — From My Phone</title>
      <dc:creator>mikebains41-debug</dc:creator>
      <pubDate>Sun, 17 May 2026 19:24:31 +0000</pubDate>
      <link>https://dev.to/mikebains41debug/titlei-built-a-production-gpu-energy-optimizer-in-one-day-from-my-phone-1d2e</link>
      <guid>https://dev.to/mikebains41debug/titlei-built-a-production-gpu-energy-optimizer-in-one-day-from-my-phone-1d2e</guid>
      <description>&lt;p&gt;I Built a Production GPU Energy Optimizer in One Day — From My Phone&lt;/p&gt;

&lt;p&gt;Not from a MacBook. Not from a cloud VM. From my Android phone, &lt;br&gt;
using Termux.&lt;/p&gt;

&lt;p&gt;Here's what shipped by end of day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time GPU energy dashboard&lt;/li&gt;
&lt;li&gt;DESYNC &amp;amp; GHOST power anomaly detection&lt;/li&gt;
&lt;li&gt;17 cloud provider support&lt;/li&gt;
&lt;li&gt;Per-user API keys&lt;/li&gt;
&lt;li&gt;Time-series metrics scaling to 100+ GPUs&lt;/li&gt;
&lt;li&gt;18/18 smoke tests passing&lt;/li&gt;
&lt;li&gt;60-second Docker install&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Problem&lt;/p&gt;

&lt;p&gt;GPU providers lie. Not intentionally — but telemetry desync is real.&lt;/p&gt;

&lt;p&gt;Two failure modes kill your energy budget:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DESYNC&lt;/strong&gt; — GPU drawing 420W but reporting 8% utilization. &lt;br&gt;
You're paying full price for a GPU doing nothing useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GHOST power&lt;/strong&gt; — GPU reporting 98% utilization at 40W draw. &lt;br&gt;
Physically impossible. Your scheduler is making decisions on &lt;br&gt;
fake data.&lt;/p&gt;

&lt;p&gt;We found both in the wild across AWS and Vast.ai during testing.&lt;/p&gt;

&lt;p&gt;The Solution&lt;/p&gt;

&lt;p&gt;An open validation stack that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detects DESYNC and GHOST anomalies automatically&lt;/li&gt;
&lt;li&gt;Works across 17 GPU cloud providers&lt;/li&gt;
&lt;li&gt;Evicts bad workloads via Kubernetes or Run:ai&lt;/li&gt;
&lt;li&gt;Alerts via Slack&lt;/li&gt;
&lt;li&gt;Stores time-series data for 100+ GPUs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What We Built&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CEI Formal Specification&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana Dashboard&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU Agent Script&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-user API Keys&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-series DB&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17-Provider Validator&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smoke Test 18/18&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Docker one-liner&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Why Termux&lt;/p&gt;

&lt;p&gt;No laptop. No cloud IDE. Just an Android phone with Termux.&lt;/p&gt;

&lt;p&gt;This matters because it proves the stack is lightweight enough &lt;br&gt;
to run anywhere. If it builds and runs on a phone, it runs on &lt;br&gt;
any bare metal server, VPS, or edge node.&lt;/p&gt;

&lt;p&gt;60-Second Install&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
 Install Docker (skip if already installed)
curl -fsSL https://get.docker.com | sh

 Clone and run
git clone https://github.com/mikebains41-debug/ai-gpu-energy-optimizer-
cd ai-gpu-energy-optimizer-
docker-compose up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
    </item>
  </channel>
</rss>
