This is a submission for the GitHub Finish-Up-A-Thon Challenge
What I Built
I built an open‑source GPU Energy Optimizer that detects a previously unknown telemetry anomaly: NVIDIA A100 GPUs draw 146.66W while reporting 0% utilization – sustained for 10+ minutes (Tests 13 & 14). I call this ghost power.
The project validates the anomaly with 35 hardware tests (24 A100 + 11 H100) and defines a new efficiency metric: CEI (Compute Energy Intensity) – FLOPs per joule. It includes a live API, a dashboard, a white paper, and a full enterprise‑scale architecture (TimescaleDB, batching, Prometheus, Kubernetes Helm, Morpheus pipeline).
🔗 GitHub repo (v1.0.1 release): github.com/mikebains41-debug/ai-gpu-energy-optimizer-
📄 White paper: WHITEPAPER.md
🚀 Live API: ai-gpu-brain-v3.onrender.com/docs
📊 Dashboard: ai-gpu-energy-optimizer.vercel.app
Demo
Before – early prototype (v0.1):
- SQLite database (single file, no concurrency)
- Direct HTTP POST per GPU (no batching)
- No Prometheus metrics or dashboards
- Manual deployment (docker‑compose only)
- 35 hardware tests, but no automated platform suite
After – production‑ready (v1.0.1 – final release):
- TimescaleDB (PostgreSQL + hypertables, continuous aggregates)
- Batched agent (30s windows, Redis queue, Celery workers)
- Prometheus exporter + Grafana dashboards (native GHOST metric)
- Kubernetes Helm chart + DaemonSet for agent deployment
- Morpheus pipeline for real‑time anomaly detection and auto‑alert
- 75 tests (35 hardware + 40 platform) passing in CI
(Add your screenshots here)
The Comeback Story
Where the project was before:
I started this as a personal validation on RunPod, running 24 A100 tests from my Samsung phone using Termux. The code was a loose collection of scripts, a single FastAPI instance with SQLite, and no scalability. It proved the anomaly existed – but it wasn’t ready for real fleets.
What I changed, fixed, and added to finish it (v1.0.1):
Over the past month, I rewrote the entire stack:
- Database: Migrated from SQLite to TimescaleDB (hypertables, continuous aggregates).
- Agent: Added batching, retries, and async sending (30s windows).
- Queuing: Integrated Redis + Celery to decouple ingestion from processing.
- Observability: Built a Prometheus exporter with GHOST/DESYNC metrics.
- Orchestration: Created a Kubernetes Helm chart and DaemonSet for agent deployment.
- AI Pipeline: Wrote a Morpheus pipeline that pulls live API data, scores CEI, and auto‑alerts.
- Testing: Grew from 35 hardware tests to 75 total (including 40 platform validation tests).
The finishing moment was running the full enterprise test suite on a simulated 1000‑GPU cluster (using the new Morpheus test harness) and seeing all 30 M1‑M30 tests pass – then tagging the v1.0.1 release on GitHub.
My Experience with GitHub Copilot
I used AI assistance (including GitHub Copilot) throughout the rewrite:
- Copilot suggested the TimescaleDB hypertable syntax and the best indexing strategies for time‑partitioned data.
-
It auto‑completed the batched agent’s async methods – saving hours of debugging
asyncioedge cases. - When writing the Helm chart, Copilot generated the correct YAML structure for GPU node tolerations and volume mounts.
-
For the Morpheus pipeline, it filled in the boilerplate for the
GpuTelemetryProcessorStageand the CEI scoring logic. -
It also helped refactor the monolithic
main.pyinto modularmodels.py,prometheus_metrics.py, andmorpheus/pipeline.py.
The most valuable part was pair‑debugging: I’d describe an error (e.g., SQLAlchemy connection pool timeouts), and Copilot would suggest the fix (adding pool_pre_ping=True). Without this, finishing the enterprise stack would have taken twice as long.
This project is my proof that a solo developer – even from a phone – can build production‑grade infrastructure. The “finish” isn’t the end; it’s the foundation for scaling to 1000 GPUs and beyond.
AI tools were used in drafting this article and generating code.
Top comments (0)