How I built a multi-model Ollama comparison tool with zero dependencies

greedy — Tue, 26 May 2026 10:31:55 +0000

How I built a multi-model Ollama comparison tool with zero dependencies

The problem

I kept opening multiple terminal tabs running ollama run to compare
how different local models handled the same prompt. Copy-pasting between
terminals doesn't scale. I needed a proper comparison tool — but everything
I found was either a heavy web UI or required a dozen pip packages.

The approach

Build a curses TUI that talks to Ollama directly. Python standard library
only — urllib for the API, curses for the interface, json/re for
parsing. One file. No pip install.

How it works

Side-by-side streaming. The streaming viewer renders tokens from
multiple models simultaneously in a curses grid. Each model gets a panel
with live stats (elapsed time, tokens generated, tokens/second).

Evaluation modes beyond comparison. Simple side-by-side viewing wasn't
enough, so I added structured evaluation modes:

RALPH — self-review loop. Model answers, then critiques and improves
its own answer. Uses cosine similarity via Ollama embeddings to detect
convergence (stops when the model can't meaningfully improve).
Council — multi-persona debate. Three personas (Domain Expert,
Skeptic, Devil's Advocate) debate a question across rounds, then a
synthesis verdict follows. Each persona can run on a different model.
Tribunal — adversarial cross-examination. A defender model makes
claims, a prosecutor model challenges them. An arbiter model rules on
each challenge.

Benchmark mode. Runs 20 standardised tests against any model:
counting, arithmetic, web search, URL fetching, shell commands, Python
execution, file reading, RSS parsing, embeddings, and more. Each test
checks if the model called the right tool or produced the right output.

Tool orchestration. Models can be armed with external tools — all
opt-in, checked at startup:

Web search (SearXNG or DuckDuckGo)
URL fetching with summarisation
Shell command execution (sandboxed)
Python code execution
File reading with configurable root
Safe calculator (AST-based whitelist evaluation)

Architecture

The source is modular under src/prompter/ with 12 modules. A bundler
assembles them into a single prompter.py for distribution:

src/prompter/
  header.py       # Shebang + module docstring
  config.py       # Environment variables, constants
  palette.py      # curses colour init + helpers
  tools.py        # Tool definitions, checks, execution
  api.py          # Ollama REST API (urllib)
  utils.py        # Text processing, regex patterns
  tui/
    utils.py      # Safe drawing wrappers
    viewer.py     # Streaming viewer widget
  modes/
    dataclasses.py # CouncilSession, BullySession
    runners.py    # run_*_mode() functions
  markdown.py     # Output file generation
  main.py         # Screen state machine + entry point

The bundler strips relative imports, deduplicates stdlib imports, and
concatenates in dependency order.

What I learned

curses is deceptively complex. Safe drawing wrappers are essential —
out-of-bounds writes crash silently. Every addstr call needs a bounds
check.
Token streaming over urllib is straightforward. Ollama's streaming
endpoint sends newline-delimited JSON chunks. You just read lines and
parse.
Context windows are a real constraint. In multi-round debate modes,
the message history grows fast. I implemented _truncate_to_context()
that estimates token count and keeps the first and last halves of content.
Cosine similarity works for self-review convergence. By embedding
consecutive answers and comparing similarity, RALPH mode can detect
when further review rounds won't improve the answer.

Try it

wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py

Python 3.7+ and Ollama running locally. That's it.

GitHub: https://github.com/whonixnetworks/prompter

DEV Community: greedy

How I built a multi-model Ollama comparison tool with zero dependencies