How do you know if your router is actually better? A code-oriented framework for A/B testing routing logic, not just models.
If you're building second-half routing—routers that reason, search, and learn—you need a way to evaluate them. Not just the models behind the router. The routing logic itself.
This post walks through a concrete, code-oriented design for RouterEval: a small framework for answering questions like:
- "Is Router B actually better than Router A under the same constraints?"
- "Which routing policy gives me the best cost/quality frontier?"
- "How does my router behave under multi-turn, adversarial, or high-risk tasks?"
(This is Part 2. For the routing patterns themselves, see Second-Half Routing: From Traffic Control to Collective Intelligence
1. Goals and Core Idea
We're not evaluating individual models. We're evaluating routing policies:
- Router A: simple cost-based model selection
- Router B: semantic routing + strategy tree
- Router C: semantic + verifier + reflexion
Key properties of the harness:
- Same dataset across routers
- Same constraints (budget, latency bounds)
- Logs routed paths, not just final outputs
- Supports online and offline A/B
- Tracks cost/quality/risk tradeoffs
2. Data Model: What You Evaluate On
At minimum, you need a routing eval dataset:
type RoutingEvalCase = {
id: string;
prompt: string;
expected?: string; // for tasks with ground truth
meta?: {
domain?: string;
difficulty?: "low" | "medium" | "high";
risk?: "low" | "medium" | "high";
type?: "qa" | "summarization" | "generation" | "tooling";
};
};
And a way to store router decisions and outcomes:
type RouterDecisionTrace = {
routerId: string; // "router-A", "router-B", etc.
caseId: string;
strategy: string; // e.g., "SLM_DIRECT", "LLM_WITH_VERIFIER"
path: string[]; // sequence of nodes / steps taken
modelCalls: {
model: string;
costTokens: number;
latencyMs: number;
role: string; // "analysis" | "reasoning" | "verifier" | ...
}[];
finalOutput: string;
metadata?: any; // e.g., self-eval scores, uncertainty estimates
};
And evaluation results per run:
type RouterEvalResult = {
routerId: string;
caseId: string;
qualityScore: number; // 0-1
pass: boolean; // thresholded
costTokensTotal: number;
latencyTotalMs: number;
riskFlags?: string[]; // e.g., "hallucination", "policy_violation"
};
3. The Router Interface
Define a clear interface so you can plug in different routing policies:
type RouteContext = {
prompt: string;
meta?: RoutingEvalCase["meta"];
// optional: session state, user id, etc.
};
type RouterOutput = {
answer: string;
trace: RouterDecisionTrace;
};
interface Router {
id: string;
route(ctx: RouteContext): Promise<RouterOutput>;
}
Example implementation:
class SimpleRouter implements Router {
id = "simple-router";
async route(ctx: RouteContext): Promise<RouterOutput> {
const model = pickCheapModel();
const { output, costTokens, latencyMs } = await callModel(model, ctx.prompt);
const trace: RouterDecisionTrace = {
routerId: this.id,
caseId: "", // filled by harness
strategy: "CHEAP_DIRECT",
path: ["CHEAP_DIRECT"],
modelCalls: [
{ model, costTokens, latencyMs, role: "direct" },
],
finalOutput: output,
};
return { answer: output, trace };
}
}
4. Quality Scoring: Ground Truth + LLM as Judge
You'll typically mix:
- Exact / fuzzy string matching (for deterministic tasks)
- LLM-as-judge scoring (for open-ended tasks)
- Specialized verifiers (factuality, safety, style)
4.1 LLM-as-Judge Shape
async function scoreWithLLM(
expected: string | undefined,
answer: string,
prompt: string,
): Promise<number> {
if (!expected) {
// no ground truth: score relevance and usefulness
const judgment = await callLLM("judge-model", {
system: "Score the answer from 0 to 1 for usefulness and correctness.",
user: JSON.stringify({ prompt, answer }),
});
return parseFloatScore(judgment);
}
const judgment = await callLLM("judge-model", {
system: "Score the answer from 0 to 1 based on how well it matches expected.",
user: JSON.stringify({ expected, answer }),
});
return parseFloatScore(judgment);
}
4.2 Full Evaluation Step for One Router + Case
async function evaluateCaseWithRouter(
router: Router,
testCase: RoutingEvalCase,
): Promise<{ trace: RouterDecisionTrace; result: RouterEvalResult }> {
const { answer, trace } = await router.route({
prompt: testCase.prompt,
meta: testCase.meta,
});
// back-fill caseId
trace.caseId = testCase.id;
const qualityScore = await scoreWithLLM(
testCase.expected,
answer,
testCase.prompt,
);
const costTokensTotal = trace.modelCalls.reduce(
(sum, call) => sum + call.costTokens,
0,
);
const latencyTotalMs = trace.modelCalls.reduce(
(sum, call) => sum + call.latencyMs,
0,
);
const result: RouterEvalResult = {
routerId: trace.routerId,
caseId: trace.caseId,
qualityScore,
pass: qualityScore >= 0.8, // configurable
costTokensTotal,
latencyTotalMs,
riskFlags: [], // fill via guardrail/safety checks if desired
};
return { trace, result };
}
5. Offline A/B: Batch Evaluation Across Routers
5.1 Basic Harness
async function evaluateRoutersOnDataset(
routers: Router[],
dataset: RoutingEvalCase[],
) {
const allResults: RouterEvalResult[] = [];
const allTraces: RouterDecisionTrace[] = [];
for (const testCase of dataset) {
for (const router of routers) {
const { trace, result } = await evaluateCaseWithRouter(router, testCase);
allTraces.push(trace);
allResults.push(result);
}
}
return { allTraces, allResults };
}
5.2 Aggregation and Comparison
type AggregatedMetrics = {
routerId: string;
passRate: number;
avgQuality: number;
avgCostTokens: number;
avgLatencyMs: number;
cases: number;
};
function aggregateResults(results: RouterEvalResult[]): AggregatedMetrics[] {
const byRouter = new Map<string, RouterEvalResult[]>();
for (const r of results) {
if (!byRouter.has(r.routerId)) byRouter.set(r.routerId, []);
byRouter.get(r.routerId)!.push(r);
}
const aggregates: AggregatedMetrics[] = [];
for (const [routerId, list] of byRouter.entries()) {
const cases = list.length;
const passRate = list.filter(r => r.pass).length / cases;
const avgQuality =
list.reduce((sum, r) => sum + r.qualityScore, 0) / cases;
const avgCostTokens =
list.reduce((sum, r) => sum + r.costTokensTotal, 0) / cases;
const avgLatencyMs =
list.reduce((sum, r) => sum + r.latencyTotalMs, 0) / cases;
aggregates.push({
routerId,
passRate,
avgQuality,
avgCostTokens,
avgLatencyMs,
cases,
});
}
return aggregates;
}
You can extend this to:
- Group by domain, difficulty, risk
- Compute Pareto frontiers (quality vs cost)
- Highlight cases where routers diverge sharply
6. Online A/B: Shadow Mode for Routers
Offline eval is necessary but not sufficient. You also need to see how routers behave on live traffic.
6.1 Shadow Evaluation Pattern
Production flow:
- Online router (Router A) handles real user request
- In the background, Router B runs on the same input (no user impact)
- Compare results offline with user feedback, cost, and guardrail behavior
async function handleUserRequest(prompt: string, userId: string) {
// primary router
const primaryRouter = getPrimaryRouter();
const { answer, trace } = await primaryRouter.route({ prompt });
// fire-and-forget shadow eval
void runShadowRouters(prompt, trace, userId);
return answer;
}
async function runShadowRouters(
prompt: string,
primaryTrace: RouterDecisionTrace,
userId: string,
) {
const shadowRouters = getShadowRouters();
for (const router of shadowRouters) {
const { answer, trace } = await router.route({ prompt });
await logShadowComparison({
userId,
prompt,
primary: primaryTrace,
shadow: trace,
shadowAnswer: answer,
});
}
}
Later, you can:
- Compare cost/quality distributions
- Identify cases where shadow router clearly dominates
- Build confidence before switching primary
7. Evaluating Router Behavior, Not Just Outputs
For second-half routing, outputs are not the only thing that matters. You also care about:
- Strategy choice: Did it pick the right path given risk/difficulty?
- Escalation behavior: Did it escalate when it should?
- Guardrail behavior: Did it refuse when policy required?
- Over-spend vs under-spend: Did it waste cost on low-value tasks?
You can encode these as behavioral assertions.
7.1 Behavioral Checks
type BehaviorCheckResult = {
routerId: string;
caseId: string;
checkName: string;
passed: boolean;
details?: string;
};
function checkEscalationBehavior(
trace: RouterDecisionTrace,
testCase: RoutingEvalCase,
): BehaviorCheckResult {
const isHighRisk = testCase.meta?.risk === "high";
const escalated =
trace.modelCalls.some(call => call.role === "reasoning") &&
trace.modelCalls.some(call => call.role === "verifier");
const passed = !isHighRisk || escalated;
return {
routerId: trace.routerId,
caseId: trace.caseId,
checkName: "HIGH_RISK_ESCALATION",
passed,
details: passed
? "OK"
: "High-risk case did not escalate to reasoning + verifier",
};
}
You can write similar checks for:
- "Low-risk tasks should use cheap paths"
- "Guardrails must trigger on forbidden content"
- "Clarification must happen for ambiguous tasks"
7.2 Integrating Behavior Checks
function runBehaviorChecks(
trace: RouterDecisionTrace,
testCase: RoutingEvalCase,
): BehaviorCheckResult[] {
return [
checkEscalationBehavior(trace, testCase),
// add more checks here...
];
}
Integrate into evaluateCaseWithRouter:
const behaviorChecks = runBehaviorChecks(trace, testCase);
// store alongside RouterEvalResult
Now you're not just measuring what the router answered, but how it behaved.
8. Reporting: What You Actually Look At
The harness should give you:
- Per-router aggregates (pass rate, avg cost, avg latency)
- Per-domain/difficulty breakdowns
- Behavior check summaries (how often policies were respected)
- Case-level diffs where routers disagree strongly
Think in terms of:
| Dimension | Question |
|---|---|
| Quality vs Cost | Is Router B worth the extra tokens? |
| Quality vs Latency | Where does strategy tree search become too slow? |
| Behavioral Compliance | Does the router respect risk policies? |
You can dump AggregatedMetrics and behavior checks into a basic dashboard (or CSV → notebook):
const { allTraces, allResults } = await evaluateRoutersOnDataset(routers, dataset);
const aggregates = aggregateResults(allResults);
// render aggregates + key case diffs
9. Design Principles for a Router Evaluation Harness
Router-first abstraction:
Define a clear Router interface with traces as first-class citizens.
Same dataset, same constraints:
Always compare routers under identical conditions.
Measure behavior, not just outputs:
Encode escalation/guardrail policies as behavior checks.
Use LLM-as-judge wisely:
Cache judgments. Use small judges where possible. Don't double-spend.
Make traces explorable:
You will need to inspect paths to debug routing policies.
Support offline and online modes:
Batch eval for design; shadow eval for deployment confidence.
What's Next
If you're building multi-model systems, agent frameworks, or safety-critical LLM apps, you need to evaluate the routing layer—not just the models.
This harness gives you a starting point:
- Define your
Routerinterface - Build an eval dataset with risk/difficulty metadata
- Run offline A/B across routing policies
- Add behavior checks for escalation, guardrails, cost discipline
- Deploy shadow routing to build confidence before switching
The goal isn't to find the "best" router. It's to understand which router behaves correctly under which conditions—and to keep learning as those conditions change.
For the routing patterns this harness is designed to evaluate, see Second-Half Routing: From Traffic Control to Collective Intelligence.
For what this all means outside infrastructure—including a family-level application of collective intelligence patterns—see my Substack post: From One Big Brain to the Family Brain.
Top comments (0)