<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MegabytesNYC</title>
    <description>The latest articles on DEV Community by MegabytesNYC (@megabytesllc).</description>
    <link>https://dev.to/megabytesllc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3822977%2F631b2ece-f294-4c1b-af42-b860eb7feda7.png</url>
      <title>DEV Community: MegabytesNYC</title>
      <link>https://dev.to/megabytesllc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/megabytesllc"/>
    <language>en</language>
    <item>
      <title>I built an open-source LLM benchmarking tool with an AI judge — here's how it works</title>
      <dc:creator>MegabytesNYC</dc:creator>
      <pubDate>Fri, 13 Mar 2026 19:32:39 +0000</pubDate>
      <link>https://dev.to/megabytesllc/i-built-an-open-source-llm-benchmarking-tool-with-an-ai-judge-heres-how-it-works-2df8</link>
      <guid>https://dev.to/megabytesllc/i-built-an-open-source-llm-benchmarking-tool-with-an-ai-judge-heres-how-it-works-2df8</guid>
      <description>&lt;p&gt;When I started running local LLMs in my homelab, I kept hitting the same problem: tokens per second tells you how fast a model is. It doesn't tell you if the answer was any good.&lt;br&gt;
So I built JudgeGPT — a self-hosted tool that benchmarks multiple models simultaneously and uses a second LLM to score every response.&lt;br&gt;
The architecture&lt;br&gt;
The orchestrator is a FastAPI service that uses the Docker SDK to spawn isolated ollama/ollama containers on demand — one per model, each on its own port. They all mount ~/.ollama so models you've already pulled don't re-download.&lt;br&gt;
After benchmarks complete, a dedicated judge container (running qwen2.5:7b) scores each response on five criteria using Ollama's native JSON mode: Accuracy, Clarity, Depth, Concision, and Examples. The judge runs isolated so it doesn't compete for GPU with the models being benchmarked.&lt;br&gt;
The final leaderboard combines: TPS × 35% + TTFT × 15% + Quality × 50%. You can also add your own human star rating, which blends into the quality component.&lt;br&gt;
GPU metrics across platforms&lt;br&gt;
One of the trickier parts was getting real-time GPU telemetry working across Metal, ROCm, and CUDA. The orchestrator detects the platform at startup and routes to the right tool:&lt;/p&gt;

&lt;p&gt;macOS Apple Silicon → powermetrics&lt;br&gt;
AMD → rocm-smi&lt;br&gt;
NVIDIA → nvidia-smi&lt;/p&gt;

&lt;p&gt;These poll every 2 seconds during a benchmark run and stream to the frontend as live sparklines. The peak/avg values roll up into the results summary.&lt;br&gt;
Other features worth mentioning&lt;/p&gt;

&lt;p&gt;Download Manager tab with SSE-streamed pull progress&lt;br&gt;
Full benchmark history in SQLite with one-click restore&lt;br&gt;
Sequential mode for low-VRAM setups&lt;br&gt;
Playground for comparing two OpenAI-compatible endpoints side by side&lt;br&gt;
Export as PDF report, JSON, or CSV&lt;br&gt;
Prometheus /metrics endpoint&lt;/p&gt;

&lt;p&gt;Stack: FastAPI, Docker SDK, React 18, Vite, Recharts, Ollama, nginx&lt;br&gt;
Repo: &lt;a href="https://github.com/MegaBytesllc/judgegpt" rel="noopener noreferrer"&gt;https://github.com/MegaBytesllc/judgegpt&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
