How I built a multi-model Ollama comparison tool with zero dependencies
The problem
I kept opening multiple terminal tabs running ollama run to compare
how different local models handled the same prompt. Copy-pasting between
terminals doesn't scale. I needed a proper comparison tool — but everything
I found was either a heavy web UI or required a dozen pip packages.
The approach
Build a curses TUI that talks to Ollama directly. Python standard library
only — urllib for the API, curses for the interface, json/re for
parsing. One file. No pip install.
How it works
Side-by-side streaming. The streaming viewer renders tokens from
multiple models simultaneously in a curses grid. Each model gets a panel
with live stats (elapsed time, tokens generated, tokens/second).
Evaluation modes beyond comparison. Simple side-by-side viewing wasn't
enough, so I added structured evaluation modes:
RALPH — self-review loop. Model answers, then critiques and improves
its own answer. Uses cosine similarity via Ollama embeddings to detect
convergence (stops when the model can't meaningfully improve).Council — multi-persona debate. Three personas (Domain Expert,
Skeptic, Devil's Advocate) debate a question across rounds, then a
synthesis verdict follows. Each persona can run on a different model.Tribunal — adversarial cross-examination. A defender model makes
claims, a prosecutor model challenges them. An arbiter model rules on
each challenge.
Benchmark mode. Runs 20 standardised tests against any model:
counting, arithmetic, web search, URL fetching, shell commands, Python
execution, file reading, RSS parsing, embeddings, and more. Each test
checks if the model called the right tool or produced the right output.
Tool orchestration. Models can be armed with external tools — all
opt-in, checked at startup:
- Web search (SearXNG or DuckDuckGo)
- URL fetching with summarisation
- Shell command execution (sandboxed)
- Python code execution
- File reading with configurable root
- Safe calculator (AST-based whitelist evaluation)
Architecture
The source is modular under src/prompter/ with 12 modules. A bundler
assembles them into a single prompter.py for distribution:
src/prompter/
header.py # Shebang + module docstring
config.py # Environment variables, constants
palette.py # curses colour init + helpers
tools.py # Tool definitions, checks, execution
api.py # Ollama REST API (urllib)
utils.py # Text processing, regex patterns
tui/
utils.py # Safe drawing wrappers
viewer.py # Streaming viewer widget
modes/
dataclasses.py # CouncilSession, BullySession
runners.py # run_*_mode() functions
markdown.py # Output file generation
main.py # Screen state machine + entry point
The bundler strips relative imports, deduplicates stdlib imports, and
concatenates in dependency order.
What I learned
curses is deceptively complex. Safe drawing wrappers are essential —
out-of-bounds writes crash silently. Everyaddstrcall needs a bounds
check.Token streaming over urllib is straightforward. Ollama's streaming
endpoint sends newline-delimited JSON chunks. You just read lines and
parse.Context windows are a real constraint. In multi-round debate modes,
the message history grows fast. I implemented_truncate_to_context()
that estimates token count and keeps the first and last halves of content.Cosine similarity works for self-review convergence. By embedding
consecutive answers and comparing similarity, RALPH mode can detect
when further review rounds won't improve the answer.
Try it
wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py
Python 3.7+ and Ollama running locally. That's it.
Top comments (0)