Harish Kotra (he/him)

Posted on Feb 20

I Built a Local LLM Evaluator That Compares Models Side-by-Side - Here's How

#ai #programming #javascript #beginners

TL;DR: I built llm-eval — an open-source tool that auto-detects your locally-installed Ollama models, fires the same prompt at all of them simultaneously, and gives you hard numbers on which one is fastest, most verbose, and cheapest. It has both a CLI and a streaming web UI with markdown rendering. Here's the full technical breakdown.

The Problem

If you run local LLMs with Ollama, you've probably been in this situation:

"I just pulled three new models. Which one should I actually use?"

You open a terminal, run ollama run llama3.2, type a prompt, wait. Then you run ollama run qwen3:4b, type the same prompt again, wait. Then you try to remember how fast the first one was. Was it 60 tokens/sec or 70? What was the time to first byte? How many tokens did it output?

It's tedious, error-prone, and unscalable.

I wanted a tool that would:

Auto-detect every model I've pulled
Send the same prompt to all of them at once
Stream responses in real time
Measure everything — latency, TTFB, tokens/sec, token counts, cost
Highlight the winner so I can make informed decisions

So I built it.

Meet llm-eval

llm-eval is a TypeScript tool for evaluating and comparing Ollama models. It ships with two interfaces:

Interactive CLI — A REPL with commands for switching models, comparing all, exporting results, and managing conversation history
Web UI — A dark-themed dashboard with SSE streaming, markdown-rendered output, syntax-highlighted code blocks, and a comparison metrics table

The entire project is built on one key library: pi-ai.

The Secret Weapon: pi-ai

@mariozechner/pi-ai by Mario Zechner is a provider-agnostic TypeScript library for LLM interactions. Think of it as a universal adapter for LLMs — one API that works across 20+ providers: OpenAI, Anthropic, Google, Ollama, Groq, Mistral, and more.

Why pi-ai instead of calling the Ollama API directly?

1. Unified Streaming via Async Iterators

pi-ai's streamSimple() function returns an async iterable that yields structured events:

import { streamSimple } from '@mariozechner/pi-ai';

const eventStream = streamSimple(model, context, { apiKey: 'ollama' });

for await (const event of eventStream) {
    switch (event.type) {
        case 'text_delta':
            // A chunk of text arrived
            process.stdout.write(event.delta);
            break;
        case 'done':
            // Stream complete — usage stats available
            const { input, output, totalTokens } = event.message.usage;
            const cost = event.message.usage.cost?.total ?? 0;
            break;
        case 'error':
            // Structured error from the provider
            console.error(event.error.errorMessage);
            break;
    }
}

This is incredibly clean. No manual SSE parsing, no WebSocket management, no callback hell. Just a for await...of loop.

2. Structured Model Definitions

pi-ai uses typed Model objects that encode everything about a provider:

const model: Model<'openai-completions'> = {
    id: 'llama3.2:latest',
    name: 'llama3.2:latest',
    api: 'openai-completions',
    provider: 'ollama',
    baseUrl: 'http://localhost:11434/v1',
    reasoning: false,
    input: ['text'],
    cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
    contextWindow: 128000,
    maxTokens: 4096,
    compat: {
        supportsStore: false,
        supportsDeveloperRole: false,
        supportsReasoningEffort: false,
        supportsUsageInStreaming: true,
        maxTokensField: 'max_tokens',
        // ... more compatibility flags
    },
};

The compat field is genius — it lets pi-ai handle the quirks of each provider (Ollama doesn't support developer role, Anthropic uses max_tokens differently, etc.) without you having to care.

3. Built-in Token Tracking and Cost Estimation

The done event from pi-ai includes usage with input, output, totalTokens, and cost. For cloud providers, this gives you real cost numbers. For Ollama, cost is $0.00 — but the token counts are still invaluable for comparing model verbosity and efficiency.

4. Provider Swappability

Because pi-ai abstracts the provider, the same evaluation engine could work with OpenAI, Anthropic, or Groq by just swapping the Model definition. No code changes needed. This is the foundation for one of the most exciting enhancement possibilities: multi-provider benchmarking.

Architecture Deep Dive

Here's how all the pieces fit together:

┌─────────────────────────────────────────────────────────────┐
│                        llm-eval                             │
│                                                             │
│  ┌─────────────┐    ┌──────────────┐    ┌───────────────┐   │
│  │  CLI (REPL)  │    │  Web Server  │    │  Web Frontend │   │
│  │  index.ts    │    │  server.ts   │    │  public/      │   │
│  │  session.ts  │    │  Express +   │    │  HTML/CSS/JS  │   │
│  │             │    │  SSE         │    │  marked.js    │   │
│  └──────┬──────┘    └──────┬───────┘    └───────────────┘   │
│         │                  │                                 │
│         └────────┬─────────┘                                │
│                  ▼                                          │
│  ┌──────────────────────────────┐                           │
│  │      Evaluation Engine       │                           │
│  │  evaluator.ts  metrics.ts    │                           │
│  │  table.ts      types.ts      │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │     Model Discovery          │                           │
│  │  models.ts                   │                           │
│  │  `ollama list` → pi-ai Model │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │    @mariozechner/pi-ai       │                           │
│  │  streamSimple() → SSE events │                           │
│  └──────────────┬───────────────┘                           │
│                 ▼                                           │
│  ┌──────────────────────────────┐                           │
│  │     Ollama (localhost)       │                           │
│  │  OpenAI-compatible /v1 API   │                           │
│  └──────────────────────────────┘                           │
└─────────────────────────────────────────────────────────────┘

Layer 1: Model Discovery (`models.ts`)

The entry point is detectOllamaModels(), which shells out to ollama list:

const output = execSync('ollama list', { encoding: 'utf-8', timeout: 10000 });
const lines = output.trim().split('\n');
return lines.slice(1).map((line) => line.trim().split(/\s+/)[0]);

Simple but effective. It parses the first column (model name) from each row, skipping the header. The resulting names like llama3.2:latest are then wrapped into pi-ai Model objects via createOllamaModel().

Error handling is thoughtful — it distinguishes between "Ollama not installed" (ENOENT), "Ollama not running" (ECONNREFUSED), and generic failures, giving actionable troubleshooting guidance in each case.

Layer 2: Evaluation Engine (`evaluator.ts`)

This is the core. evaluateModel() does four things:

Starts a timer with performance.now()
Calls streamSimple() from pi-ai
Captures time-to-first-token when the first text_delta event arrives
Computes metrics when the done event fires

export async function evaluateModel(
    model: Model<'openai-completions'>,
    context: Context,
    options?: SimpleStreamOptions
): Promise<EvaluationResult> {
    const startTime = performance.now();
    let firstTokenTime: number | null = null;
    let fullOutput = '';

    const eventStream = streamSimple(model, context, { apiKey: 'ollama', ...options });

    for await (const event of eventStream) {
        switch (event.type) {
            case 'text_delta':
                if (firstTokenTime === null) firstTokenTime = performance.now();
                fullOutput += event.delta;
                process.stdout.write(event.delta);
                break;
            case 'done':
                const endTime = performance.now();
                const totalLatencyMs = Math.round(endTime - startTime);
                const ttfbMs = Math.round(firstTokenTime! - startTime);
                const tps = (event.message.usage.output / totalLatencyMs) * 1000;
                return { modelId, output: fullOutput, totalLatencyMs, ttfbMs, tps, ... };
        }
    }
}

The evaluateAllModels() function supports both sequential and concurrent evaluation modes.

Layer 3: Metrics (`metrics.ts`)

Three utilities:

computeTokensPerSecond() — Simple division: (outputTokens / latencyMs) * 1000
findBest() — Finds the model with the best value for a given metric (lowest latency, highest TPS, etc.)
computeSimilarityScores() — Jaccard word-overlap similarity between model outputs. Useful for spotting models that give wildly different answers to the same prompt.

Layer 4: Web Server (`server.ts`)

An Express 5 server with three endpoints:

Endpoint	Method	Description
`/api/models`	GET	Returns all detected Ollama models
`/api/evaluate`	POST	Evaluates one model, returns SSE stream
`/api/compare`	POST	Evaluates all models sequentially, returns SSE stream

The SSE endpoints use Express's res.write() to stream events:

res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
});

for await (const event of eventStream) {
    if (res.closed) break;  // Stop if browser disconnected

    switch (event.type) {
        case 'text_delta':
            res.write(`data: ${JSON.stringify({ type: 'text_delta', delta: event.delta })}\n\n`);
            break;
        case 'done':
            res.write(`data: ${JSON.stringify({ type: 'done', metrics: { ... } })}\n\n`);
            break;
    }
}

An important debugging lesson: I initially used req.closed to detect client disconnection. This breaks immediately on POST requests because req.closed becomes true as soon as the request body is consumed — which happens instantly for small JSON payloads. The fix was using res.closed, which correctly monitors the response stream connection. This is one of those subtle bugs that's easy to miss and hard to debug.

Layer 5: Web Frontend (`public/`)

Pure HTML/CSS/JS — no build step, no framework. The frontend handles:

Model detection — Fetches /api/models and renders clickable chips
SSE consumption — Uses the Fetch API's ReadableStream to parse SSE events
Live streaming — During streaming, raw text is displayed with white-space: pre-wrap
Markdown rendering — On completion, marked.parse() converts raw text to styled HTML, and highlight.js applies syntax highlighting to code blocks
Metrics display — Per-card inline metrics and a comparison table with best-model highlighting

The SSE consumer is hand-written rather than using EventSource because EventSource only supports GET requests — we need POST for sending prompts:

async function consumeSSEStream(body, handlers) {
    const reader = body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n');
        buffer = lines.pop() || '';

        for (const line of lines) {
            if (!line.startsWith('data: ')) continue;
            const data = JSON.parse(line.slice(6).trim());
            // Dispatch to handlers based on data.type
        }
    }
}

The Streaming Pipeline: End to End

Here's what happens when you click "Compare All" with three models selected:

Browser                    Express Server              Ollama
  │                             │                        │
  │  POST /api/compare          │                        │
  │  { prompt: "Explain TCP" }  │                        │
  │ ──────────────────────────► │                        │
  │                             │                        │
  │  SSE: compare_start         │                        │
  │ ◄────────────────────────── │                        │
  │                             │                        │
  │  ┌───── Model 1 ──────────────────────────────────┐  │
  │  │  SSE: model_start        │                      │  │
  │  │ ◄────────────────────── │  streamSimple(m1) ──►│  │
  │  │                         │                      │  │
  │  │  SSE: text_delta ×N     │  ◄── text chunks ─── │  │
  │  │ ◄────────────────────── │                      │  │
  │  │                         │                      │  │
  │  │  SSE: model_done        │  ◄── done ────────── │  │
  │  │   { metrics }           │                      │  │
  │  │ ◄────────────────────── │                      │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  ┌───── Model 2 (repeat) ──────────────────────────┐  │
  │  │ ...                     │                       │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  ┌───── Model 3 (repeat) ──────────────────────────┐  │
  │  │ ...                     │                       │  │
  │  └─────────────────────────────────────────────────┘  │
  │                             │                        │
  │  SSE: stream_end            │                        │
  │ ◄────────────────────────── │                        │
  │                             │                        │
  │  [Browser renders markdown  │                        │
  │   and shows metrics table]  │                        │

Models are evaluated sequentially in the comparison endpoint. Each model gets its own model_start → text_delta × N → model_done lifecycle. The browser renders each model's output in its own card, streaming text as it arrives, then converting to rendered markdown when model_done fires.

Metrics That Matter

For each model evaluation, llm-eval captures:

Metric	What It Measures	Why It Matters
Latency	Total time from request to last token	Overall speed
TTFB	Time to first byte/token	Perceived responsiveness
TPS	Output tokens per second	Raw generation speed
Input Tokens	Tokens in the prompt	Prompt efficiency
Output Tokens	Tokens generated	Response verbosity
Total Tokens	Input + Output	Resource consumption
Cost	Estimated cost (provider-dependent)	Budget planning
Similarity	Jaccard word overlap (CLI only)	Output consistency

In comparison mode, the best performer for each metric is automatically highlighted — green for lowest latency/TTFB, cyan for highest TPS.

What I'd Build Next

If I were to continue evolving this project, here's my priority list:

Multi-provider support — pi-ai already supports OpenAI, Anthropic, Google, Groq, etc. Adding a provider dropdown would let you compare local Ollama models against cloud APIs.
Automated scoring with a judge model — Send all outputs to GPT-4 or Claude with a rubric, get numerical scores. True apples-to-apples comparison.
Prompt regression suites — Save a set of prompts as a "test suite," run them after each model update, track quality over time.
Streaming latency charts — Real-time visualization of token arrival rate. Some models front-load tokens, others have a steady drip.
Docker Compose — docker compose up → Ollama + llm-eval running, zero setup.

Run It Yourself

# Prerequisites: Node.js ≥ 20, Ollama with at least one model
ollama pull llama3.2
ollama pull qwen3:4b    # Optional — more models = more fun

git clone https://github.com/harishkotra/llm-eval.git
cd llm-eval
npm install
npm run server
# → http://localhost:3000

The CLI is available via npm start if you prefer terminal-native workflows.

Huge shout-out to Mario Zechner and his pi-ai library. Without it, this project would have been 3x the code and 10x the headache. If you're building anything with LLMs in TypeScript, seriously check it out — it's the best provider abstraction layer I've worked with.

The full source code is available at github.com/harishkotra/llm-eval. Star it, fork it, break it, improve it.

DEV Community

I Built a Local LLM Evaluator That Compares Models Side-by-Side - Here's How

The Problem

Meet llm-eval

The Secret Weapon: pi-ai

1. Unified Streaming via Async Iterators

2. Structured Model Definitions

3. Built-in Token Tracking and Cost Estimation

4. Provider Swappability

Architecture Deep Dive

Layer 1: Model Discovery (`models.ts`)

Layer 2: Evaluation Engine (`evaluator.ts`)

Layer 3: Metrics (`metrics.ts`)

Layer 4: Web Server (`server.ts`)

Layer 5: Web Frontend (`public/`)

The Streaming Pipeline: End to End

Metrics That Matter

What I'd Build Next

Run It Yourself

Top comments (0)

The Problem

Meet llm-eval

The Secret Weapon: pi-ai

1. Unified Streaming via Async Iterators

2. Structured Model Definitions

3. Built-in Token Tracking and Cost Estimation

4. Provider Swappability

Architecture Deep Dive

Layer 1: Model Discovery (models.ts)

Layer 2: Evaluation Engine (evaluator.ts)

Layer 3: Metrics (metrics.ts)

Layer 4: Web Server (server.ts)

Layer 5: Web Frontend (public/)

The Streaming Pipeline: End to End

Metrics That Matter

What I'd Build Next

Run It Yourself

Layer 1: Model Discovery (`models.ts`)

Layer 2: Evaluation Engine (`evaluator.ts`)

Layer 3: Metrics (`metrics.ts`)

Layer 4: Web Server (`server.ts`)

Layer 5: Web Frontend (`public/`)