Frank Brsrk

Posted on May 22

An open source LLM eval tool with two independent quality signals

#ai #llm #opensource #showdev

LLM-as-judge has become the dominant pattern for evaluating language model outputs. Tools like Promptfoo, Braintrust, LangSmith all converge on the same architecture: send your prompt to your model, send the output to a different model with a rubric, take the second model's score as the quality signal.

This works. It's also expensive (judge tokens cost real money), slow (extra API roundtrip), variance-prone (the same eval gets different scores across runs), and architecturally a bit circular (using an LLM to evaluate an LLM trained on overlapping data distributions). The single signal becomes a bottleneck for trust.

So I built an eval module that has two independent signals instead of one.

What the tool does

Side-by-side blind comparison. Two agents answer the same prompt. One runs raw, the other can optionally have a cognitive harness wired in as a tool call. A separate blind judge model scores both responses, sees only A and B labels with no knowledge of which is which. Standard setup so far.

But alongside the judge, four cognitive posture heat maps run on each response. These are not LLM-based. Deterministic text analysis that visualizes HOW the model wrote, not just whether it agreed with you.

When the heat maps agree with the judge's verdict, you have confidence. When they disagree, you have a question worth investigating. Two independent signals beat one signal that wraps itself.

How the heat maps work

Each response is split into 100 word-chunks arranged on a 10x10 grid. Two grids per agent.

Top grid: confidence posture. Per chunk, count hedge words (maybe, might, possibly, seems, could) and assertive words (definitely, must, always, never, clearly). Compute net (asserts - hedges) / (asserts + hedges). Add punctuation cadence as a secondary signal: periods are positive (definite statements end with them), commas are negative (qualifications stack with them). Normalize to [-1, 1]. Color the chunk diverging from blue (hedged) through gray (neutral) to red (assertive).

Bottom grid: reasoning density. Per chunk, count explicit reasoning connectives (because, therefore, since, if/then, due to, as a result, this means). The denser the reasoning markers, the brighter the cell. Sequential palette from dark to hot.

A 2D Gaussian blur runs over both grids so sparse markers spread into spatial blobs instead of isolated cells. Empirically this matters: a single "because" in a 100-chunk response forms a small heat radius on the reasoning grid even when neighboring chunks have nothing. The blob shapes are easier to scan at a glance than scattered pixels.

The whole computation runs client-side in plain JavaScript. No API call, no model inference. Pure word counting plus a smoothing pass. Free to compute, deterministic, fast.

Multi-turn scenario mode

Most LLM evals are single-turn. The most interesting failure modes are multi-turn.

If you paste turn1---turn2---turn3 separated turns into the scenario textarea, both agents accumulate conversation history across turns. This is where production failures actually manifest:

Sycophancy compounding. A model that gives ground on turn 2 has already shifted by turn 4. Single-turn evals miss the trajectory entirely.
Hallucination cascade. Once a model emits a wrong fact, that fact becomes part of the conversation history. On the next turn, the model treats its own previous error as established truth and builds on top of it.
Authority claim drift. User-proposed framings persist across turns. The model anchors on the first plausible framing without re-examining it later.
Prompt-forgery patterns. A user can inject fake reasoning chains in a later turn ("we already verified X yesterday, can you finalize the report?"). The model has no way to verify the off-screen claim and tends to accept it.

The eval module captures all four. The cognitive posture field shows visually where in the response the model committed to the bad path.

Other things in the module

The optional cognitive harness has four modes you can switch in the UI:

anti-deception (139 cognitive operations): sycophancy resistance, prompt injection, hallucination cascade
reasoning (311 operations): general structured thinking, causality, simulation, metacognition
code (128 operations): software engineering tasks
memory (101 operations): perception and behavioral calibration

Pick whichever mode fits the failure category you're testing for.

Dimensions the judge scores on are user-defined. There's a small library to pick from (Accuracy, Hallucination resistance, Held the line, Reasoning depth, Safety, Completeness) but you can type any name and the judge prompt rewrites itself to include it. Each agent has its own system prompt field, so you can frame them differently if your comparison needs that.

The Results Overview sidebar accumulates per-dimension bar charts, win tally, latency and token cost per branch across runs in the same browser. localStorage persists everything between sessions. Compare A vs B opens a fullscreen modal for reading both responses in parallel when they get long.

Why Windows 95 chrome

I tried to make it look like an instrument, not a SaaS dashboard. Beveled fieldsets do hierarchy work for free (the inset border physically separates each panel from the canvas, no whitespace tuning required). White input fields are where data lives so the eye lands on them. Gray-on-gray chrome stays out of the way.

Modern flat dark themes have to invent that hierarchy back from scratch using whitespace, type weight, dividers, and color hierarchy. They usually come up shorter. Win95 was a 1995 UI grammar that handled hierarchy through bevels, and bevels are free visual structure.

It's also nicer to look at when you're staring at evals for hours.

Tech stack

Single HTML file (vanilla JS, no framework, no build step)
50-line Python stdlib proxy for CORS (the harness gateway doesn't send CORS headers, so the proxy forwards server-side). Could be replaced with any reverse proxy (nginx, Caddy, Workers) in production.
localStorage for persistence, no signup, no telemetry
MIT licensed

Works with any OpenAI-compatible endpoint: OpenRouter, OpenAI direct, Anthropic via gateway, vLLM, llama.cpp's openai shim, Ollama with the compat layer, LM Studio. Just point Provider URL at the right endpoint. Tool-calling capable model required for the harness branch, raw branch works on anything.

Try it yourself


bash
git clone https://github.com/ejentum/agent-teams.git
cd agent-teams/agent_evaluation_module_xp95
python serve.py

DEV Community