Built an n8n eval workflow that A/B tests any prompt through plain GPT-4o vs GPT-4o + a reasoning scaffold, judged by a blind Gemini evaluator
Solo founder here. I've been building a cognitive infrastructure API (Ejentum) and needed a way for builders to evaluate it on their own agent tasks instead of trusting my benchmarks. So I published the eval as an n8n workflow.
What it is
A three-agent n8n workflow. You paste any prompt in the chat trigger. The prompt fans out through two identical GPT-4o agents (one plain, one with an Ejentum reasoning scaffold injected via an HTTP tool). A blind Gemini Flash evaluator scores both responses on five dimensions (specificity, posture, depth, actionability, honesty) and returns structured JSON with a verdict.
The evaluator is allowed to return "tie" and regularly does. Point is you test on your own tasks and decide.
What it's actually testing
Whether the cognitive scaffold changes output posture on a given task, or not
Whether the scaffolded agent engages the specific claims in your prompt or stays generic
How the scaffold affects sycophancy, depth, and diagnostic procedure
Whether different harness modes (reasoning, anti-deception, memory, code) stress different task types. Mode is editable in the HTTP tool's JSON body
The diff is often subtle on easy prompts and more pronounced on dual-load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi-variable causal reasoning. Low-complexity single-turn tasks often produce ties because GPT-4o handles them well without a scaffold.
Where you might apply this pattern
Customer support agents: test whether the scaffold reduces rubber-stamping and increases specificity on customer complaints
Code review or diagnostic agents: test whether it catches the failure modes you actually care about
Content or research workflows: test whether it reduces generic output on your topics
Multi-agent systems: wrap any single agent call in the fork to see the effect before integrating permanently
Prompt engineering A/B tests: measure the effect of a cognitive layer against your own prompt iterations
Setup
Import Reasoning_Harness_Eval_Workflow.json
Set three credentials: OpenAI (both producer agents), Google Gemini (blind evaluator), Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)
Paste a prompt in the chat trigger
Workflow diagram:
[attach screenshots/eval_workflow.png]
A vs B output from one run:
[attach screenshots/A_vs_B.png]
Blind evaluator verdict JSON from the same run:
[attach screenshots/A_B__blind_eval.png]
Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor): https://github.com/ejentum/eval



Top comments (0)