DEV Community

Frank Brsrk
Frank Brsrk

Posted on

Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

Built an n8n eval workflow that A/B tests any prompt through plain GPT-4o vs GPT-4o + a reasoning scaffold, judged by a blind Gemini evaluator

Solo founder here. I've been building a cognitive infrastructure API (Ejentum) and needed a way for builders to evaluate it on their own agent tasks instead of trusting my benchmarks. So I published the eval as an n8n workflow.

What it is
A three-agent n8n workflow. You paste any prompt in the chat trigger. The prompt fans out through two identical GPT-4o agents (one plain, one with an Ejentum reasoning scaffold injected via an HTTP tool). A blind Gemini Flash evaluator scores both responses on five dimensions (specificity, posture, depth, actionability, honesty) and returns structured JSON with a verdict.

The evaluator is allowed to return "tie" and regularly does. Point is you test on your own tasks and decide.

What it's actually testing
Whether the cognitive scaffold changes output posture on a given task, or not

Whether the scaffolded agent engages the specific claims in your prompt or stays generic

How the scaffold affects sycophancy, depth, and diagnostic procedure

Whether different harness modes (reasoning, anti-deception, memory, code) stress different task types. Mode is editable in the HTTP tool's JSON body

The diff is often subtle on easy prompts and more pronounced on dual-load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi-variable causal reasoning. Low-complexity single-turn tasks often produce ties because GPT-4o handles them well without a scaffold.

Where you might apply this pattern
Customer support agents: test whether the scaffold reduces rubber-stamping and increases specificity on customer complaints

Code review or diagnostic agents: test whether it catches the failure modes you actually care about

Content or research workflows: test whether it reduces generic output on your topics

Multi-agent systems: wrap any single agent call in the fork to see the effect before integrating permanently

Prompt engineering A/B tests: measure the effect of a cognitive layer against your own prompt iterations

Setup
Import Reasoning_Harness_Eval_Workflow.json

Set three credentials: OpenAI (both producer agents), Google Gemini (blind evaluator), Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)

Paste a prompt in the chat trigger

Workflow diagram:
[attach screenshots/eval_workflow.png]

A vs B output from one run:
[attach screenshots/A_vs_B.png]

Blind evaluator verdict JSON from the same run:
[attach screenshots/A_B__blind_eval.png]

Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor): https://github.com/ejentum/eval


Top comments (0)