From Ghost Power Discovery to Enterprise GPU Optimizer – How I Finished What I Started

#devchallenge #githubchallenge

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

I built an open‑source GPU Energy Optimizer that detects a previously unknown telemetry anomaly: NVIDIA A100 GPUs draw 146.66W while reporting 0% utilization – sustained for 10+ minutes (Tests 13 & 14). I call this ghost power.

The project validates the anomaly with 35 hardware tests (24 A100 + 11 H100) and defines a new efficiency metric: CEI (Compute Energy Intensity) – FLOPs per joule. It includes a live API, a dashboard, a white paper, and a full enterprise‑scale architecture (TimescaleDB, batching, Prometheus, Kubernetes Helm, Morpheus pipeline).

🔗 GitHub repo (v1.0.1 release): github.com/mikebains41-debug/ai-gpu-energy-optimizer-

📄 White paper: WHITEPAPER.md

🚀 Live API: ai-gpu-brain-v3.onrender.com/docs

📊 Dashboard: ai-gpu-energy-optimizer.vercel.app

Demo

Before – early prototype (v0.1):

SQLite database (single file, no concurrency)
Direct HTTP POST per GPU (no batching)
No Prometheus metrics or dashboards
Manual deployment (docker‑compose only)
35 hardware tests, but no automated platform suite

After – production‑ready (v1.0.1 – final release):

TimescaleDB (PostgreSQL + hypertables, continuous aggregates)
Batched agent (30s windows, Redis queue, Celery workers)
Prometheus exporter + Grafana dashboards (native GHOST metric)
Kubernetes Helm chart + DaemonSet for agent deployment
Morpheus pipeline for real‑time anomaly detection and auto‑alert
75 tests (35 hardware + 40 platform) passing in CI

(Add your screenshots here)

The Comeback Story

Where the project was before:

I started this as a personal validation on RunPod, running 24 A100 tests from my Samsung phone using Termux. The code was a loose collection of scripts, a single FastAPI instance with SQLite, and no scalability. It proved the anomaly existed – but it wasn’t ready for real fleets.

What I changed, fixed, and added to finish it (v1.0.1):

Over the past month, I rewrote the entire stack:

Database: Migrated from SQLite to TimescaleDB (hypertables, continuous aggregates).
Agent: Added batching, retries, and async sending (30s windows).
Queuing: Integrated Redis + Celery to decouple ingestion from processing.
Observability: Built a Prometheus exporter with GHOST/DESYNC metrics.
Orchestration: Created a Kubernetes Helm chart and DaemonSet for agent deployment.
AI Pipeline: Wrote a Morpheus pipeline that pulls live API data, scores CEI, and auto‑alerts.
Testing: Grew from 35 hardware tests to 75 total (including 40 platform validation tests).

The finishing moment was running the full enterprise test suite on a simulated 1000‑GPU cluster (using the new Morpheus test harness) and seeing all 30 M1‑M30 tests pass – then tagging the v1.0.1 release on GitHub.

My Experience with GitHub Copilot

I used AI assistance (including GitHub Copilot) throughout the rewrite:

Copilot suggested the TimescaleDB hypertable syntax and the best indexing strategies for time‑partitioned data.
It auto‑completed the batched agent’s async methods – saving hours of debugging asyncio edge cases.
When writing the Helm chart, Copilot generated the correct YAML structure for GPU node tolerations and volume mounts.
For the Morpheus pipeline, it filled in the boilerplate for the GpuTelemetryProcessorStage and the CEI scoring logic.
It also helped refactor the monolithic main.py into modular models.py, prometheus_metrics.py, and morpheus/pipeline.py.

The most valuable part was pair‑debugging: I’d describe an error (e.g., SQLAlchemy connection pool timeouts), and Copilot would suggest the fix (adding pool_pre_ping=True). Without this, finishing the enterprise stack would have taken twice as long.

This project is my proof that a solo developer – even from a phone – can build production‑grade infrastructure. The “finish” isn’t the end; it’s the foundation for scaling to 1000 GPUs and beyond.

AI tools were used in drafting this article and generating code.