Frank Brsrk

Posted on Apr 23

I built a Python module to A/B test prompts inside Claude Code, and you can run it on yours

#ai #llm #agents #python

Same model. Same prompt. Baseline tells the patient to eat healthier. With an Ejentum reasoning scaffold injected, the agent asks for a thyroid panel.

That's a real diff from the workflow I'm about to walk you through. The prompt was a medical second-opinion (45M patient, pre-diabetic markers, dyslipidemia, vitamin D deficiency). Both agents were gpt-4o at temperature 0. The only difference: the scaffolded agent had a function-call tool that retrieved a structured reasoning constraint set at runtime and absorbed it before responding.

A blind Gemini Flash judge scored both responses on five dimensions and ruled B superior, 20 to 16. The judge's stated reason:

"Response B is superior because it directly addresses the patient's symptom of 'sluggishness' by linking it to the Vitamin D deficiency and suggesting further diagnostic steps like thyroid testing."

This article is about the Python module that produced that result, why I built it, and how to run it inside your own IDE on your own prompts in about 5 minutes.

The problem this exists to solve
If you ship agents, you've lived this loop:

You tweak a system prompt
Add a tool, swap a model, change phrasing
The output looks different
You can't actually tell if it's better, or just rotated
Prompt engineering is mostly intuition. Vendors hand you benchmarks and ask you to trust them. What you actually want is a way to test, on your own task, whether your changes are lifting your agent's reasoning or just dressing it up.

I built this module because I needed that for myself. I'm a solo founder dogfooding Claude Code daily. Every time I added structure to a system prompt, I had no honest way to verify whether the agent was reasoning more carefully or just producing different-shaped slop.

The module gives me a verdict.

What it does
A Python script (zero third-party dependencies, just stdlib) that:

Forks any prompt through two identical gpt-4o agents at temperature 0
Agent A runs plain. No tools. Strong directive system prompt.
Agent B runs with the same baseline system prompt PLUS the Ejentum reasoning skill file PLUS a forced function-call to the Ejentum Logic API. The agent autonomously crafts the query and picks the harness mode (reasoning or reasoning-multi) per the skill file's decision table.
The API returns a structured "cognitive scaffold" — a reasoning constraint set with [NEGATIVE GATE], [REASONING TOPOLOGY], [FALSIFICATION TEST], and Suppress/Amplify signals. The agent absorbs it and responds.
Both responses go to a blind Gemini Flash judge (different model family from the producers, so no shared-bias contamination). The judge sees neutral "Response A / Response B" labels and never knows which is which.
The judge returns structured JSON: scores per dimension (specificity, posture, depth, actionability, honesty), totals, justifications, and a verdict (A, B, or tie).
That's it. One prompt in, structured verdict out.

Running it inside Claude Code
Setup, in three steps.

Step 1: get three API keys
OpenAI (platform.openai.com/api-keys) for both producer agents
Google Gemini (aistudio.google.com/app/apikey) for the blind judge
Ejentum (ejentum.com), 100 free calls, no card required
Set them in env:

export OPENAI_API_KEY=sk-...
export GEMINI_API_KEY=AI...
export EJENTUM_API_KEY=zpka_...
Step 2: clone the module

git clone https://github.com/ejentum/eval
cd eval/python
Step 3: run it
From the command line, with a prompt of your choice:

python orchestrator.py "Should we pivot our SaaS to enterprise next quarter?"
Or call from Python:

from orchestrator import run_eval

result = run_eval("Should we pivot our SaaS to enterprise next quarter?")

print(result["evaluation"]["verdict"]) # "A" | "B" | "tie"
print(result["evaluation"]["totals"]) # {"A": 16, "B": 20}
print(result["evaluation"]["verdict_reason"]) # one-sentence reason
That's the whole interface.

When you run inside Claude Code (or Cursor or Antigravity), you can ask your IDE-agent to do this on your behalf. Tell it: "Run the eval module on this prompt I'm working on." The agent reads the README, runs the script, parses the JSON, and reports back the verdict with the judge's quoted reason. The same way you'd hand a junior engineer a script and ask for the result.

What you get back
Here's the JSON shape (real output from the medical run linked at the end):

{
"user_message": "Medical Report: ...",
"baseline_response": "Based on the laboratory results...",
"ejentum_response": "The patient's laboratory results indicate...",
"evaluation": {
"scores": {
"A": {"specificity": 3, "posture": 3, "depth": 3, "actionability": 3, "honesty": 4},
"B": {"specificity": 4, "posture": 4, "depth": 4, "actionability": 4, "honesty": 4}
},
"totals": {"A": 16, "B": 20},
"justifications": {
"specificity": "Response B is more specific in linking the Vitamin D deficiency to the patient's reported sluggishness and suggesting thyroid function tests to rule out other metabolic disorders.",
"posture": "Response B is more substantive, challenging the primary physician's general recommendation by suggesting a more comprehensive approach...",
"depth": "Response B reasons more deeply about the problem...",
"actionability": "Response B provides more actionable recommendations...",
"honesty": "Both responses acknowledge the limitations of diet and exercise alone..."
},
"verdict": "B",
"verdict_reason": "Response B is superior because it directly addresses the patient's symptom of 'sluggishness' by linking it to the Vitamin D deficiency and suggesting further diagnostic steps like thyroid testing."
},
"scaffold_used": "[NEGATIVE GATE]\nThe analysis stopped at...",
"tool_call": {
"query": "Patient is a 45-year-old male reporting sluggishness...",
"mode": "reasoning-multi"
}
}
You see everything: both responses verbatim, the per-dimension scores, why the judge ruled the way it did, the live scaffold that was injected into Agent B, and the exact query+mode the agent autonomously picked.

Nothing summarized away.

Why I designed it this way (transparency choices)
Three things matter when you publish a tool that claims your product is better:

Trace. You need to see every step. Not "the model improved" but "the model called this tool, received this scaffold, executed this reasoning, scored this on this dimension." This module exposes the full chain.
Auditability. All three system prompts (baseline, augmented, evaluator) are published as readable markdown in the repo, not buried in code. The Ejentum reasoning skill file the augmented agent receives is bundled. Anyone reading the repo can verify exactly what was given to each agent.
Verifiability. The judge runs on a different model family from the producers (Gemini vs OpenAI). It receives only neutral A/B labels. Anyone with API keys can clone the repo, re-run the same script, and compare.

Most "we improved your agent" claims ask you to trust a benchmark someone else ran. This hands you the instrument and lets you run it on your own task.

What happens when it ties (because it does)
The blind judge is allowed to return "tie" and regularly does.

If your prompt is a low-complexity single-turn task (a simple question, a clear lookup, a known pattern), gpt-4o handles it well without any scaffold. Both responses will be similar. The judge will tie them. That's a real signal, not a failure of the tool.

The scaffold's lift shows on prompts where baseline gpt-4o has a specific failure mode: sycophancy toward authority figures, shallow single-cause framing of multi-cause problems, generic templated responses to specific claims, missing differential diagnosis on ambiguous data.

The medical second-opinion prompt landed in that territory because:

The patient's reported symptom (sluggishness) was distinct from the lab values, and baseline got distracted by the lab walkthrough
The PCP's recommendation was vague enough that baseline had room to either accept or challenge, and baseline accepted
The labs cluster into a recognizable metabolic syndrome pattern, but spotting that requires synthesis, not enumeration
That's the kind of prompt where the scaffold's [NEGATIVE GATE] and Suppress signals do real work. On "what's 2+2", they don't.

If you run this on five of your own prompts and four tie, that doesn't mean the scaffold is broken. It means four of your prompts don't stress the kind of failure mode the scaffold prevents. Run it on harder ones.

Try it on a hard prompt
Some categories where I've seen the scaffold lift consistently:

Validation traps: "I think we're fine because [other metric is up]" - baseline often validates; scaffolded names the false framing
Multi-variable causal questions: "MRR grew but retention dropped, what should I do" - baseline picks one cause; scaffolded traces the chain
Symptom-vs-lab questions: anything where the user's stated complaint diverges from the data they provide
Strategic advice with a buried false premise: "should I pivot because my best customer said so" - baseline rubber-stamps; scaffolded probes
Diagnostic prompts with ambiguous evidence: "my agent fails sometimes, what's wrong" - baseline guesses; scaffolded asks isolating questions
If your work involves any of these patterns, the module is worth 5 minutes.

Links
Module: github.com/ejentum/eval/tree/main/python
Worked example, fully replicable: github.com/ejentum/eval/tree/main/various_blind_eval_results/medical-second-opinion
Ejentum API key (free, 100 calls): ejentum.com

Top comments (1)

Frank Brsrk • Apr 23

If you've shipped an agent and tried to verify whether a prompt change actually helped, what did you use to check? Eyeballing? A custom eval rig? Vibes? I'm curious what other builders rely on, and whether anything else in this space lets you A/B with a blind third-party judge.

Drop your approach in the comments. Especially interested in counter-examples where you found a tool that does this differently.🔥