DEV Community

Cover image for How I built a multi-model Ollama comparison tool with zero dependencies
greedy
greedy

Posted on

How I built a multi-model Ollama comparison tool with zero dependencies

How I built a multi-model Ollama comparison tool with zero dependencies

The problem

I kept opening multiple terminal tabs running ollama run to compare
how different local models handled the same prompt. Copy-pasting between
terminals doesn't scale. I needed a proper comparison tool — but everything
I found was either a heavy web UI or required a dozen pip packages.

The approach

Build a curses TUI that talks to Ollama directly. Python standard library
only — urllib for the API, curses for the interface, json/re for
parsing. One file. No pip install.

How it works

Side-by-side streaming. The streaming viewer renders tokens from
multiple models simultaneously in a curses grid. Each model gets a panel
with live stats (elapsed time, tokens generated, tokens/second).

Evaluation modes beyond comparison. Simple side-by-side viewing wasn't
enough, so I added structured evaluation modes:

  • RALPH — self-review loop. Model answers, then critiques and improves
    its own answer. Uses cosine similarity via Ollama embeddings to detect
    convergence (stops when the model can't meaningfully improve).

  • Council — multi-persona debate. Three personas (Domain Expert,
    Skeptic, Devil's Advocate) debate a question across rounds, then a
    synthesis verdict follows. Each persona can run on a different model.

  • Tribunal — adversarial cross-examination. A defender model makes
    claims, a prosecutor model challenges them. An arbiter model rules on
    each challenge.

Benchmark mode. Runs 20 standardised tests against any model:
counting, arithmetic, web search, URL fetching, shell commands, Python
execution, file reading, RSS parsing, embeddings, and more. Each test
checks if the model called the right tool or produced the right output.

Tool orchestration. Models can be armed with external tools — all
opt-in, checked at startup:

  • Web search (SearXNG or DuckDuckGo)
  • URL fetching with summarisation
  • Shell command execution (sandboxed)
  • Python code execution
  • File reading with configurable root
  • Safe calculator (AST-based whitelist evaluation)

Architecture

The source is modular under src/prompter/ with 12 modules. A bundler
assembles them into a single prompter.py for distribution:

src/prompter/
  header.py       # Shebang + module docstring
  config.py       # Environment variables, constants
  palette.py      # curses colour init + helpers
  tools.py        # Tool definitions, checks, execution
  api.py          # Ollama REST API (urllib)
  utils.py        # Text processing, regex patterns
  tui/
    utils.py      # Safe drawing wrappers
    viewer.py     # Streaming viewer widget
  modes/
    dataclasses.py # CouncilSession, BullySession
    runners.py    # run_*_mode() functions
  markdown.py     # Output file generation
  main.py         # Screen state machine + entry point
Enter fullscreen mode Exit fullscreen mode

The bundler strips relative imports, deduplicates stdlib imports, and
concatenates in dependency order.

What I learned

  1. curses is deceptively complex. Safe drawing wrappers are essential —
    out-of-bounds writes crash silently. Every addstr call needs a bounds
    check.

  2. Token streaming over urllib is straightforward. Ollama's streaming
    endpoint sends newline-delimited JSON chunks. You just read lines and
    parse.

  3. Context windows are a real constraint. In multi-round debate modes,
    the message history grows fast. I implemented _truncate_to_context()
    that estimates token count and keeps the first and last halves of content.

  4. Cosine similarity works for self-review convergence. By embedding
    consecutive answers and comparing similarity, RALPH mode can detect
    when further review rounds won't improve the answer.

Try it

wget https://raw.githubusercontent.com/whonixnetworks/prompter/main/prompter.py
chmod +x prompter.py
python3 prompter.py
Enter fullscreen mode Exit fullscreen mode

Python 3.7+ and Ollama running locally. That's it.

GitHub: https://github.com/whonixnetworks/prompter

Top comments (0)