DEV Community

MegabytesNYC
MegabytesNYC

Posted on

I built an open-source LLM benchmarking tool with an AI judge — here's how it works

When I started running local LLMs in my homelab, I kept hitting the same problem: tokens per second tells you how fast a model is. It doesn't tell you if the answer was any good.
So I built JudgeGPT — a self-hosted tool that benchmarks multiple models simultaneously and uses a second LLM to score every response.
The architecture
The orchestrator is a FastAPI service that uses the Docker SDK to spawn isolated ollama/ollama containers on demand — one per model, each on its own port. They all mount ~/.ollama so models you've already pulled don't re-download.
After benchmarks complete, a dedicated judge container (running qwen2.5:7b) scores each response on five criteria using Ollama's native JSON mode: Accuracy, Clarity, Depth, Concision, and Examples. The judge runs isolated so it doesn't compete for GPU with the models being benchmarked.
The final leaderboard combines: TPS × 35% + TTFT × 15% + Quality × 50%. You can also add your own human star rating, which blends into the quality component.
GPU metrics across platforms
One of the trickier parts was getting real-time GPU telemetry working across Metal, ROCm, and CUDA. The orchestrator detects the platform at startup and routes to the right tool:

macOS Apple Silicon → powermetrics
AMD → rocm-smi
NVIDIA → nvidia-smi

These poll every 2 seconds during a benchmark run and stream to the frontend as live sparklines. The peak/avg values roll up into the results summary.
Other features worth mentioning

Download Manager tab with SSE-streamed pull progress
Full benchmark history in SQLite with one-click restore
Sequential mode for low-VRAM setups
Playground for comparing two OpenAI-compatible endpoints side by side
Export as PDF report, JSON, or CSV
Prometheus /metrics endpoint

Stack: FastAPI, Docker SDK, React 18, Vite, Recharts, Ollama, nginx
Repo: https://github.com/MegaBytesllc/judgegpt

Top comments (0)