TL;DR: I built llm-eval — an open-source tool that auto-detects your locally-installed Ollama models, fires the same prompt at all of them simultaneously, and gives you hard numbers on which one is fastest, most verbose, and cheapest. It has both a CLI and a streaming web UI with markdown rendering. Here's the full technical breakdown.
The Problem
If you run local LLMs with Ollama, you've probably been in this situation:
"I just pulled three new models. Which one should I actually use?"
You open a terminal, run ollama run llama3.2, type a prompt, wait. Then you run ollama run qwen3:4b, type the same prompt again, wait. Then you try to remember how fast the first one was. Was it 60 tokens/sec or 70? What was the time to first byte? How many tokens did it output?
It's tedious, error-prone, and unscalable.
I wanted a tool that would:
- Auto-detect every model I've pulled
- Send the same prompt to all of them at once
- Stream responses in real time
- Measure everything — latency, TTFB, tokens/sec, token counts, cost
- Highlight the winner so I can make informed decisions
So I built it.
Meet llm-eval
llm-eval is a TypeScript tool for evaluating and comparing Ollama models. It ships with two interfaces:
- Interactive CLI — A REPL with commands for switching models, comparing all, exporting results, and managing conversation history
- Web UI — A dark-themed dashboard with SSE streaming, markdown-rendered output, syntax-highlighted code blocks, and a comparison metrics table
The entire project is built on one key library: pi-ai.
The Secret Weapon: pi-ai
@mariozechner/pi-ai by Mario Zechner is a provider-agnostic TypeScript library for LLM interactions. Think of it as a universal adapter for LLMs — one API that works across 20+ providers: OpenAI, Anthropic, Google, Ollama, Groq, Mistral, and more.
Why pi-ai instead of calling the Ollama API directly?
1. Unified Streaming via Async Iterators
pi-ai's streamSimple() function returns an async iterable that yields structured events:
import { streamSimple } from '@mariozechner/pi-ai';
const eventStream = streamSimple(model, context, { apiKey: 'ollama' });
for await (const event of eventStream) {
switch (event.type) {
case 'text_delta':
// A chunk of text arrived
process.stdout.write(event.delta);
break;
case 'done':
// Stream complete — usage stats available
const { input, output, totalTokens } = event.message.usage;
const cost = event.message.usage.cost?.total ?? 0;
break;
case 'error':
// Structured error from the provider
console.error(event.error.errorMessage);
break;
}
}
This is incredibly clean. No manual SSE parsing, no WebSocket management, no callback hell. Just a for await...of loop.
2. Structured Model Definitions
pi-ai uses typed Model objects that encode everything about a provider:
const model: Model<'openai-completions'> = {
id: 'llama3.2:latest',
name: 'llama3.2:latest',
api: 'openai-completions',
provider: 'ollama',
baseUrl: 'http://localhost:11434/v1',
reasoning: false,
input: ['text'],
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
contextWindow: 128000,
maxTokens: 4096,
compat: {
supportsStore: false,
supportsDeveloperRole: false,
supportsReasoningEffort: false,
supportsUsageInStreaming: true,
maxTokensField: 'max_tokens',
// ... more compatibility flags
},
};
The compat field is genius — it lets pi-ai handle the quirks of each provider (Ollama doesn't support developer role, Anthropic uses max_tokens differently, etc.) without you having to care.
3. Built-in Token Tracking and Cost Estimation
The done event from pi-ai includes usage with input, output, totalTokens, and cost. For cloud providers, this gives you real cost numbers. For Ollama, cost is $0.00 — but the token counts are still invaluable for comparing model verbosity and efficiency.
4. Provider Swappability
Because pi-ai abstracts the provider, the same evaluation engine could work with OpenAI, Anthropic, or Groq by just swapping the Model definition. No code changes needed. This is the foundation for one of the most exciting enhancement possibilities: multi-provider benchmarking.
Architecture Deep Dive
Here's how all the pieces fit together:
┌─────────────────────────────────────────────────────────────┐
│ llm-eval │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ CLI (REPL) │ │ Web Server │ │ Web Frontend │ │
│ │ index.ts │ │ server.ts │ │ public/ │ │
│ │ session.ts │ │ Express + │ │ HTML/CSS/JS │ │
│ │ │ │ SSE │ │ marked.js │ │
│ └──────┬──────┘ └──────┬───────┘ └───────────────┘ │
│ │ │ │
│ └────────┬─────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Evaluation Engine │ │
│ │ evaluator.ts metrics.ts │ │
│ │ table.ts types.ts │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Model Discovery │ │
│ │ models.ts │ │
│ │ `ollama list` → pi-ai Model │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ @mariozechner/pi-ai │ │
│ │ streamSimple() → SSE events │ │
│ └──────────────┬───────────────┘ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Ollama (localhost) │ │
│ │ OpenAI-compatible /v1 API │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Layer 1: Model Discovery (models.ts)
The entry point is detectOllamaModels(), which shells out to ollama list:
const output = execSync('ollama list', { encoding: 'utf-8', timeout: 10000 });
const lines = output.trim().split('\n');
return lines.slice(1).map((line) => line.trim().split(/\s+/)[0]);
Simple but effective. It parses the first column (model name) from each row, skipping the header. The resulting names like llama3.2:latest are then wrapped into pi-ai Model objects via createOllamaModel().
Error handling is thoughtful — it distinguishes between "Ollama not installed" (ENOENT), "Ollama not running" (ECONNREFUSED), and generic failures, giving actionable troubleshooting guidance in each case.
Layer 2: Evaluation Engine (evaluator.ts)
This is the core. evaluateModel() does four things:
-
Starts a timer with
performance.now() -
Calls
streamSimple()from pi-ai -
Captures time-to-first-token when the first
text_deltaevent arrives -
Computes metrics when the
doneevent fires
export async function evaluateModel(
model: Model<'openai-completions'>,
context: Context,
options?: SimpleStreamOptions
): Promise<EvaluationResult> {
const startTime = performance.now();
let firstTokenTime: number | null = null;
let fullOutput = '';
const eventStream = streamSimple(model, context, { apiKey: 'ollama', ...options });
for await (const event of eventStream) {
switch (event.type) {
case 'text_delta':
if (firstTokenTime === null) firstTokenTime = performance.now();
fullOutput += event.delta;
process.stdout.write(event.delta);
break;
case 'done':
const endTime = performance.now();
const totalLatencyMs = Math.round(endTime - startTime);
const ttfbMs = Math.round(firstTokenTime! - startTime);
const tps = (event.message.usage.output / totalLatencyMs) * 1000;
return { modelId, output: fullOutput, totalLatencyMs, ttfbMs, tps, ... };
}
}
}
The evaluateAllModels() function supports both sequential and concurrent evaluation modes.
Layer 3: Metrics (metrics.ts)
Three utilities:
-
computeTokensPerSecond()— Simple division:(outputTokens / latencyMs) * 1000 -
findBest()— Finds the model with the best value for a given metric (lowest latency, highest TPS, etc.) -
computeSimilarityScores()— Jaccard word-overlap similarity between model outputs. Useful for spotting models that give wildly different answers to the same prompt.
Layer 4: Web Server (server.ts)
An Express 5 server with three endpoints:
| Endpoint | Method | Description |
|---|---|---|
/api/models |
GET | Returns all detected Ollama models |
/api/evaluate |
POST | Evaluates one model, returns SSE stream |
/api/compare |
POST | Evaluates all models sequentially, returns SSE stream |
The SSE endpoints use Express's res.write() to stream events:
res.writeHead(200, {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
});
for await (const event of eventStream) {
if (res.closed) break; // Stop if browser disconnected
switch (event.type) {
case 'text_delta':
res.write(`data: ${JSON.stringify({ type: 'text_delta', delta: event.delta })}\n\n`);
break;
case 'done':
res.write(`data: ${JSON.stringify({ type: 'done', metrics: { ... } })}\n\n`);
break;
}
}
An important debugging lesson: I initially used req.closed to detect client disconnection. This breaks immediately on POST requests because req.closed becomes true as soon as the request body is consumed — which happens instantly for small JSON payloads. The fix was using res.closed, which correctly monitors the response stream connection. This is one of those subtle bugs that's easy to miss and hard to debug.
Layer 5: Web Frontend (public/)
Pure HTML/CSS/JS — no build step, no framework. The frontend handles:
-
Model detection — Fetches
/api/modelsand renders clickable chips -
SSE consumption — Uses the Fetch API's
ReadableStreamto parse SSE events -
Live streaming — During streaming, raw text is displayed with
white-space: pre-wrap -
Markdown rendering — On completion,
marked.parse()converts raw text to styled HTML, andhighlight.jsapplies syntax highlighting to code blocks - Metrics display — Per-card inline metrics and a comparison table with best-model highlighting
The SSE consumer is hand-written rather than using EventSource because EventSource only supports GET requests — we need POST for sending prompts:
async function consumeSSEStream(body, handlers) {
const reader = body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() || '';
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const data = JSON.parse(line.slice(6).trim());
// Dispatch to handlers based on data.type
}
}
}
The Streaming Pipeline: End to End
Here's what happens when you click "Compare All" with three models selected:
Browser Express Server Ollama
│ │ │
│ POST /api/compare │ │
│ { prompt: "Explain TCP" } │ │
│ ──────────────────────────► │ │
│ │ │
│ SSE: compare_start │ │
│ ◄────────────────────────── │ │
│ │ │
│ ┌───── Model 1 ──────────────────────────────────┐ │
│ │ SSE: model_start │ │ │
│ │ ◄────────────────────── │ streamSimple(m1) ──►│ │
│ │ │ │ │
│ │ SSE: text_delta ×N │ ◄── text chunks ─── │ │
│ │ ◄────────────────────── │ │ │
│ │ │ │ │
│ │ SSE: model_done │ ◄── done ────────── │ │
│ │ { metrics } │ │ │
│ │ ◄────────────────────── │ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌───── Model 2 (repeat) ──────────────────────────┐ │
│ │ ... │ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ ┌───── Model 3 (repeat) ──────────────────────────┐ │
│ │ ... │ │ │
│ └─────────────────────────────────────────────────┘ │
│ │ │
│ SSE: stream_end │ │
│ ◄────────────────────────── │ │
│ │ │
│ [Browser renders markdown │ │
│ and shows metrics table] │ │
Models are evaluated sequentially in the comparison endpoint. Each model gets its own model_start → text_delta × N → model_done lifecycle. The browser renders each model's output in its own card, streaming text as it arrives, then converting to rendered markdown when model_done fires.
Metrics That Matter
For each model evaluation, llm-eval captures:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Latency | Total time from request to last token | Overall speed |
| TTFB | Time to first byte/token | Perceived responsiveness |
| TPS | Output tokens per second | Raw generation speed |
| Input Tokens | Tokens in the prompt | Prompt efficiency |
| Output Tokens | Tokens generated | Response verbosity |
| Total Tokens | Input + Output | Resource consumption |
| Cost | Estimated cost (provider-dependent) | Budget planning |
| Similarity | Jaccard word overlap (CLI only) | Output consistency |
In comparison mode, the best performer for each metric is automatically highlighted — green for lowest latency/TTFB, cyan for highest TPS.
What I'd Build Next
If I were to continue evolving this project, here's my priority list:
Multi-provider support — pi-ai already supports OpenAI, Anthropic, Google, Groq, etc. Adding a provider dropdown would let you compare local Ollama models against cloud APIs.
Automated scoring with a judge model — Send all outputs to GPT-4 or Claude with a rubric, get numerical scores. True apples-to-apples comparison.
Prompt regression suites — Save a set of prompts as a "test suite," run them after each model update, track quality over time.
Streaming latency charts — Real-time visualization of token arrival rate. Some models front-load tokens, others have a steady drip.
Docker Compose —
docker compose up→ Ollama + llm-eval running, zero setup.
Run It Yourself
# Prerequisites: Node.js ≥ 20, Ollama with at least one model
ollama pull llama3.2
ollama pull qwen3:4b # Optional — more models = more fun
git clone https://github.com/harishkotra/llm-eval.git
cd llm-eval
npm install
npm run server
# → http://localhost:3000
The CLI is available via npm start if you prefer terminal-native workflows.
Huge shout-out to Mario Zechner and his pi-ai library. Without it, this project would have been 3x the code and 10x the headache. If you're building anything with LLMs in TypeScript, seriously check it out — it's the best provider abstraction layer I've worked with.
The full source code is available at github.com/harishkotra/llm-eval. Star it, fork it, break it, improve it.
Top comments (0)