DEV Community

Cover image for I Built a Local LLM Evaluator That Compares Models Side-by-Side - Here's How
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

I Built a Local LLM Evaluator That Compares Models Side-by-Side - Here's How

TL;DR: I built llm-eval — an open-source tool that auto-detects your locally-installed Ollama models, fires the same prompt at all of them simultaneously, and gives you hard numbers on which one is fastest, most verbose, and cheapest. It has both a CLI and a streaming web UI with markdown rendering. Here's the full technical breakdown.

The Problem

If you run local LLMs with Ollama, you've probably been in this situation:

"I just pulled three new models. Which one should I actually use?"

You open a terminal, run ollama run llama3.2, type a prompt, wait. Then you run ollama run qwen3:4b, type the same prompt again, wait. Then you try to remember how fast the first one was. Was it 60 tokens/sec or 70? What was the time to first byte? How many tokens did it output?

It's tedious, error-prone, and unscalable.

I wanted a tool that would:

  1. Auto-detect every model I've pulled
  2. Send the same prompt to all of them at once
  3. Stream responses in real time
  4. Measure everything — latency, TTFB, tokens/sec, token counts, cost
  5. Highlight the winner so I can make informed decisions

So I built it.


Meet llm-eval

llm-eval is a TypeScript tool for evaluating and comparing Ollama models. It ships with two interfaces:

  • Interactive CLI — A REPL with commands for switching models, comparing all, exporting results, and managing conversation history
  • Web UI — A dark-themed dashboard with SSE streaming, markdown-rendered output, syntax-highlighted code blocks, and a comparison metrics table

The entire project is built on one key library: pi-ai.


The Secret Weapon: pi-ai

@mariozechner/pi-ai by Mario Zechner is a provider-agnostic TypeScript library for LLM interactions. Think of it as a universal adapter for LLMs — one API that works across 20+ providers: OpenAI, Anthropic, Google, Ollama, Groq, Mistral, and more.

Why pi-ai instead of calling the Ollama API directly?

1. Unified Streaming via Async Iterators

pi-ai's streamSimple() function returns an async iterable that yields structured events:

import { streamSimple } from '@mariozechner/pi-ai';

const eventStream = streamSimple(model, context, { apiKey: 'ollama' });

for await (const event of eventStream) {
    switch (event.type) {
        case 'text_delta':
            // A chunk of text arrived
            process.stdout.write(event.delta);
            break;
        case 'done':
            // Stream complete — usage stats available
            const { input, output, totalTokens } = event.message.usage;
            const cost = event.message.usage.cost?.total ?? 0;
            break;
        case 'error':
            // Structured error from the provider
            console.error(event.error.errorMessage);
            break;
    }
}
Enter fullscreen mode Exit fullscreen mode

This is incredibly clean. No manual SSE parsing, no WebSocket management, no callback hell. Just a for await...of loop.

2. Structured Model Definitions

pi-ai uses typed Model objects that encode everything about a provider:

const model: Model<'openai-completions'> = {
    id: 'llama3.2:latest',
    name: 'llama3.2:latest',
    api: 'openai-completions',
    provider: 'ollama',
    baseUrl: 'http://localhost:11434/v1',
    reasoning: false,
    input: ['text'],
    cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
    contextWindow: 128000,
    maxTokens: 4096,
    compat: {
        supportsStore: false,
        supportsDeveloperRole: false,
        supportsReasoningEffort: false,
        supportsUsageInStreaming: true,
        maxTokensField: 'max_tokens',
        // ... more compatibility flags
    },
};
Enter fullscreen mode Exit fullscreen mode

The compat field is genius — it lets pi-ai handle the quirks of each provider (Ollama doesn't support developer role, Anthropic uses max_tokens differently, etc.) without you having to care.

3. Built-in Token Tracking and Cost Estimation

The done event from pi-ai includes usage with input, output, totalTokens, and cost. For cloud providers, this gives you real cost numbers. For Ollama, cost is $0.00 — but the token counts are still invaluable for comparing model verbosity and efficiency.

4. Provider Swappability

Because pi-ai abstracts the provider, the same evaluation engine could work with OpenAI, Anthropic, or Groq by just swapping the Model definition. No code changes needed. This is the foundation for one of the most exciting enhancement possibilities: multi-provider benchmarking.


Architecture Deep Dive

Here's how all the pieces fit together:

┌─────────────────────────────────────────────────────────────┐
│                        llm-eval                             │
│                                                             │
│  ┌─────────────┐    ┌──────────────┐    ┌───────────────┐   │
│  │  CLI (REPL)  │    │  Web Server  │    │  Web Frontend │   │
│  │  index.ts    │    │  server.ts   │    │  public/      │   │
│  │  session.ts  │    │  Express +   │    │  HTML/CSS/JS  │   │
│  │             │    │  SSE         │    │  marked.js    │   │
│  └──────┬──────┘    └──────┬───────┘    └───────────────┘   │
│         │                  │                                 │
│         └────────┬─────────┘                                │
│                  ▼                                          │
│  ┌──────────────────────────────┐                           │
│  │      Evaluation Engine       │                           │
│  │  evaluator.ts  metrics.ts    │                           │
│  │  table.ts      types.ts      │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │     Model Discovery          │                           │
│  │  models.ts                   │                           │
│  │  `ollama list` → pi-ai Model │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │    @mariozechner/pi-ai       │                           │
│  │  streamSimple() → SSE events │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │     Ollama (localhost)       │                           │
│  │  OpenAI-compatible /v1 API   │                           │
│  └──────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Layer 1: Model Discovery (models.ts)

The entry point is detectOllamaModels(), which shells out to ollama list:

const output = execSync('ollama list', { encoding: 'utf-8', timeout: 10000 });
const lines = output.trim().split('\n');
return lines.slice(1).map((line) => line.trim().split(/\s+/)[0]);
Enter fullscreen mode Exit fullscreen mode

Simple but effective. It parses the first column (model name) from each row, skipping the header. The resulting names like llama3.2:latest are then wrapped into pi-ai Model objects via createOllamaModel().

Error handling is thoughtful — it distinguishes between "Ollama not installed" (ENOENT), "Ollama not running" (ECONNREFUSED), and generic failures, giving actionable troubleshooting guidance in each case.

Layer 2: Evaluation Engine (evaluator.ts)

This is the core. evaluateModel() does four things:

  1. Starts a timer with performance.now()
  2. Calls streamSimple() from pi-ai
  3. Captures time-to-first-token when the first text_delta event arrives
  4. Computes metrics when the done event fires
export async function evaluateModel(
    model: Model<'openai-completions'>,
    context: Context,
    options?: SimpleStreamOptions
): Promise<EvaluationResult> {
    const startTime = performance.now();
    let firstTokenTime: number | null = null;
    let fullOutput = '';

    const eventStream = streamSimple(model, context, { apiKey: 'ollama', ...options });

    for await (const event of eventStream) {
        switch (event.type) {
            case 'text_delta':
                if (firstTokenTime === null) firstTokenTime = performance.now();
                fullOutput += event.delta;
                process.stdout.write(event.delta);
                break;
            case 'done':
                const endTime = performance.now();
                const totalLatencyMs = Math.round(endTime - startTime);
                const ttfbMs = Math.round(firstTokenTime! - startTime);
                const tps = (event.message.usage.output / totalLatencyMs) * 1000;
                return { modelId, output: fullOutput, totalLatencyMs, ttfbMs, tps, ... };
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The evaluateAllModels() function supports both sequential and concurrent evaluation modes.

Layer 3: Metrics (metrics.ts)

Three utilities:

  • computeTokensPerSecond() — Simple division: (outputTokens / latencyMs) * 1000
  • findBest() — Finds the model with the best value for a given metric (lowest latency, highest TPS, etc.)
  • computeSimilarityScores() — Jaccard word-overlap similarity between model outputs. Useful for spotting models that give wildly different answers to the same prompt.

Layer 4: Web Server (server.ts)

An Express 5 server with three endpoints:

Endpoint Method Description
/api/models GET Returns all detected Ollama models
/api/evaluate POST Evaluates one model, returns SSE stream
/api/compare POST Evaluates all models sequentially, returns SSE stream

The SSE endpoints use Express's res.write() to stream events:

res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
});

for await (const event of eventStream) {
    if (res.closed) break;  // Stop if browser disconnected

    switch (event.type) {
        case 'text_delta':
            res.write(`data: ${JSON.stringify({ type: 'text_delta', delta: event.delta })}\n\n`);
            break;
        case 'done':
            res.write(`data: ${JSON.stringify({ type: 'done', metrics: { ... } })}\n\n`);
            break;
    }
}
Enter fullscreen mode Exit fullscreen mode

An important debugging lesson: I initially used req.closed to detect client disconnection. This breaks immediately on POST requests because req.closed becomes true as soon as the request body is consumed — which happens instantly for small JSON payloads. The fix was using res.closed, which correctly monitors the response stream connection. This is one of those subtle bugs that's easy to miss and hard to debug.

Layer 5: Web Frontend (public/)

Pure HTML/CSS/JS — no build step, no framework. The frontend handles:

  1. Model detection — Fetches /api/models and renders clickable chips
  2. SSE consumption — Uses the Fetch API's ReadableStream to parse SSE events
  3. Live streaming — During streaming, raw text is displayed with white-space: pre-wrap
  4. Markdown rendering — On completion, marked.parse() converts raw text to styled HTML, and highlight.js applies syntax highlighting to code blocks
  5. Metrics display — Per-card inline metrics and a comparison table with best-model highlighting

The SSE consumer is hand-written rather than using EventSource because EventSource only supports GET requests — we need POST for sending prompts:

async function consumeSSEStream(body, handlers) {
    const reader = body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n');
        buffer = lines.pop() || '';

        for (const line of lines) {
            if (!line.startsWith('data: ')) continue;
            const data = JSON.parse(line.slice(6).trim());
            // Dispatch to handlers based on data.type
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

The Streaming Pipeline: End to End

Here's what happens when you click "Compare All" with three models selected:

Browser                    Express Server              Ollama
  │                             │                        │
  │  POST /api/compare          │                        │
  │  { prompt: "Explain TCP" }  │                        │
  │ ──────────────────────────► │                        │
  │                             │                        │
  │  SSE: compare_start         │                        │
  │ ◄────────────────────────── │                        │
  │                             │                        │
  │  ┌───── Model 1 ──────────────────────────────────┐  │
  │  │  SSE: model_start        │                      │  │
  │  │ ◄────────────────────── │  streamSimple(m1) ──►│  │
  │  │                         │                      │  │
  │  │  SSE: text_delta ×N     │  ◄── text chunks ─── │  │
  │  │ ◄────────────────────── │                      │  │
  │  │                         │                      │  │
  │  │  SSE: model_done        │  ◄── done ────────── │  │
  │  │   { metrics }           │                      │  │
  │  │ ◄────────────────────── │                      │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  ┌───── Model 2 (repeat) ──────────────────────────┐  │
  │  │ ...                     │                       │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  ┌───── Model 3 (repeat) ──────────────────────────┐  │
  │  │ ...                     │                       │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  SSE: stream_end            │                        │
  │ ◄────────────────────────── │                        │
  │                             │                        │
  │  [Browser renders markdown  │                        │
  │   and shows metrics table]  │                        │
Enter fullscreen mode Exit fullscreen mode

Models are evaluated sequentially in the comparison endpoint. Each model gets its own model_starttext_delta × N → model_done lifecycle. The browser renders each model's output in its own card, streaming text as it arrives, then converting to rendered markdown when model_done fires.


Metrics That Matter

For each model evaluation, llm-eval captures:

Metric What It Measures Why It Matters
Latency Total time from request to last token Overall speed
TTFB Time to first byte/token Perceived responsiveness
TPS Output tokens per second Raw generation speed
Input Tokens Tokens in the prompt Prompt efficiency
Output Tokens Tokens generated Response verbosity
Total Tokens Input + Output Resource consumption
Cost Estimated cost (provider-dependent) Budget planning
Similarity Jaccard word overlap (CLI only) Output consistency

In comparison mode, the best performer for each metric is automatically highlighted — green for lowest latency/TTFB, cyan for highest TPS.


What I'd Build Next

If I were to continue evolving this project, here's my priority list:

  1. Multi-provider support — pi-ai already supports OpenAI, Anthropic, Google, Groq, etc. Adding a provider dropdown would let you compare local Ollama models against cloud APIs.

  2. Automated scoring with a judge model — Send all outputs to GPT-4 or Claude with a rubric, get numerical scores. True apples-to-apples comparison.

  3. Prompt regression suites — Save a set of prompts as a "test suite," run them after each model update, track quality over time.

  4. Streaming latency charts — Real-time visualization of token arrival rate. Some models front-load tokens, others have a steady drip.

  5. Docker Composedocker compose up → Ollama + llm-eval running, zero setup.


Run It Yourself

# Prerequisites: Node.js ≥ 20, Ollama with at least one model
ollama pull llama3.2
ollama pull qwen3:4b    # Optional — more models = more fun

git clone https://github.com/harishkotra/llm-eval.git
cd llm-eval
npm install
npm run server
# → http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

The CLI is available via npm start if you prefer terminal-native workflows.

Huge shout-out to Mario Zechner and his pi-ai library. Without it, this project would have been 3x the code and 10x the headache. If you're building anything with LLMs in TypeScript, seriously check it out — it's the best provider abstraction layer I've worked with.

The full source code is available at github.com/harishkotra/llm-eval. Star it, fork it, break it, improve it.

Top comments (0)