DEV Community

mikebains41-debug
mikebains41-debug

Posted on

From Ghost Power Discovery to Enterprise GPU Optimizer – How I Finished What I Started

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

I built an open‑source GPU Energy Optimizer that detects a previously unknown telemetry anomaly: NVIDIA A100 GPUs draw 146.66W while reporting 0% utilization – sustained for 10+ minutes (Tests 13 & 14). I call this ghost power.

The project validates the anomaly with 35 hardware tests (24 A100 + 11 H100) and defines a new efficiency metric: CEI (Compute Energy Intensity) – FLOPs per joule. It includes a live API, a dashboard, a white paper, and a full enterprise‑scale architecture (TimescaleDB, batching, Prometheus, Kubernetes Helm, Morpheus pipeline).

🔗 GitHub repo (v1.0.1 release): github.com/mikebains41-debug/ai-gpu-energy-optimizer-

📄 White paper: WHITEPAPER.md

🚀 Live API: ai-gpu-brain-v3.onrender.com/docs

📊 Dashboard: ai-gpu-energy-optimizer.vercel.app

Demo

Before – early prototype (v0.1):

  • SQLite database (single file, no concurrency)
  • Direct HTTP POST per GPU (no batching)
  • No Prometheus metrics or dashboards
  • Manual deployment (docker‑compose only)
  • 35 hardware tests, but no automated platform suite

After – production‑ready (v1.0.1 – final release):

  • TimescaleDB (PostgreSQL + hypertables, continuous aggregates)
  • Batched agent (30s windows, Redis queue, Celery workers)
  • Prometheus exporter + Grafana dashboards (native GHOST metric)
  • Kubernetes Helm chart + DaemonSet for agent deployment
  • Morpheus pipeline for real‑time anomaly detection and auto‑alert
  • 75 tests (35 hardware + 40 platform) passing in CI

(Add your screenshots here)

The Comeback Story

Where the project was before:

I started this as a personal validation on RunPod, running 24 A100 tests from my Samsung phone using Termux. The code was a loose collection of scripts, a single FastAPI instance with SQLite, and no scalability. It proved the anomaly existed – but it wasn’t ready for real fleets.

What I changed, fixed, and added to finish it (v1.0.1):

Over the past month, I rewrote the entire stack:

  • Database: Migrated from SQLite to TimescaleDB (hypertables, continuous aggregates).
  • Agent: Added batching, retries, and async sending (30s windows).
  • Queuing: Integrated Redis + Celery to decouple ingestion from processing.
  • Observability: Built a Prometheus exporter with GHOST/DESYNC metrics.
  • Orchestration: Created a Kubernetes Helm chart and DaemonSet for agent deployment.
  • AI Pipeline: Wrote a Morpheus pipeline that pulls live API data, scores CEI, and auto‑alerts.
  • Testing: Grew from 35 hardware tests to 75 total (including 40 platform validation tests).

The finishing moment was running the full enterprise test suite on a simulated 1000‑GPU cluster (using the new Morpheus test harness) and seeing all 30 M1‑M30 tests pass – then tagging the v1.0.1 release on GitHub.

My Experience with GitHub Copilot

I used AI assistance (including GitHub Copilot) throughout the rewrite:

  • Copilot suggested the TimescaleDB hypertable syntax and the best indexing strategies for time‑partitioned data.
  • It auto‑completed the batched agent’s async methods – saving hours of debugging asyncio edge cases.
  • When writing the Helm chart, Copilot generated the correct YAML structure for GPU node tolerations and volume mounts.
  • For the Morpheus pipeline, it filled in the boilerplate for the GpuTelemetryProcessorStage and the CEI scoring logic.
  • It also helped refactor the monolithic main.py into modular models.py, prometheus_metrics.py, and morpheus/pipeline.py.

The most valuable part was pair‑debugging: I’d describe an error (e.g., SQLAlchemy connection pool timeouts), and Copilot would suggest the fix (adding pool_pre_ping=True). Without this, finishing the enterprise stack would have taken twice as long.


This project is my proof that a solo developer – even from a phone – can build production‑grade infrastructure. The “finish” isn’t the end; it’s the foundation for scaling to 1000 GPUs and beyond.

AI tools were used in drafting this article and generating code.

Top comments (0)