DEV Community

Cover image for RouterEval: An Evaluation Harness for LLM Routing Policies
Narnaiezzsshaa Truong
Narnaiezzsshaa Truong

Posted on

RouterEval: An Evaluation Harness for LLM Routing Policies

How do you know if your router is actually better? A code-oriented framework for A/B testing routing logic, not just models.


If you're building second-half routing—routers that reason, search, and learn—you need a way to evaluate them. Not just the models behind the router. The routing logic itself.

This post walks through a concrete, code-oriented design for RouterEval: a small framework for answering questions like:

  • "Is Router B actually better than Router A under the same constraints?"
  • "Which routing policy gives me the best cost/quality frontier?"
  • "How does my router behave under multi-turn, adversarial, or high-risk tasks?"

(This is Part 2. For the routing patterns themselves, see Second-Half Routing: From Traffic Control to Collective Intelligence


1. Goals and Core Idea

We're not evaluating individual models. We're evaluating routing policies:

  • Router A: simple cost-based model selection
  • Router B: semantic routing + strategy tree
  • Router C: semantic + verifier + reflexion

Key properties of the harness:

  • Same dataset across routers
  • Same constraints (budget, latency bounds)
  • Logs routed paths, not just final outputs
  • Supports online and offline A/B
  • Tracks cost/quality/risk tradeoffs

2. Data Model: What You Evaluate On

At minimum, you need a routing eval dataset:

type RoutingEvalCase = {
  id: string;
  prompt: string;
  expected?: string; // for tasks with ground truth
  meta?: {
    domain?: string;
    difficulty?: "low" | "medium" | "high";
    risk?: "low" | "medium" | "high";
    type?: "qa" | "summarization" | "generation" | "tooling";
  };
};
Enter fullscreen mode Exit fullscreen mode

And a way to store router decisions and outcomes:

type RouterDecisionTrace = {
  routerId: string;       // "router-A", "router-B", etc.
  caseId: string;
  strategy: string;       // e.g., "SLM_DIRECT", "LLM_WITH_VERIFIER"
  path: string[];         // sequence of nodes / steps taken
  modelCalls: {
    model: string;
    costTokens: number;
    latencyMs: number;
    role: string; // "analysis" | "reasoning" | "verifier" | ...
  }[];
  finalOutput: string;
  metadata?: any;         // e.g., self-eval scores, uncertainty estimates
};
Enter fullscreen mode Exit fullscreen mode

And evaluation results per run:

type RouterEvalResult = {
  routerId: string;
  caseId: string;
  qualityScore: number;     // 0-1
  pass: boolean;            // thresholded
  costTokensTotal: number;
  latencyTotalMs: number;
  riskFlags?: string[];     // e.g., "hallucination", "policy_violation"
};
Enter fullscreen mode Exit fullscreen mode

3. The Router Interface

Define a clear interface so you can plug in different routing policies:

type RouteContext = {
  prompt: string;
  meta?: RoutingEvalCase["meta"];
  // optional: session state, user id, etc.
};

type RouterOutput = {
  answer: string;
  trace: RouterDecisionTrace;
};

interface Router {
  id: string;
  route(ctx: RouteContext): Promise<RouterOutput>;
}
Enter fullscreen mode Exit fullscreen mode

Example implementation:

class SimpleRouter implements Router {
  id = "simple-router";

  async route(ctx: RouteContext): Promise<RouterOutput> {
    const model = pickCheapModel();
    const { output, costTokens, latencyMs } = await callModel(model, ctx.prompt);

    const trace: RouterDecisionTrace = {
      routerId: this.id,
      caseId: "", // filled by harness
      strategy: "CHEAP_DIRECT",
      path: ["CHEAP_DIRECT"],
      modelCalls: [
        { model, costTokens, latencyMs, role: "direct" },
      ],
      finalOutput: output,
    };

    return { answer: output, trace };
  }
}
Enter fullscreen mode Exit fullscreen mode

4. Quality Scoring: Ground Truth + LLM as Judge

You'll typically mix:

  • Exact / fuzzy string matching (for deterministic tasks)
  • LLM-as-judge scoring (for open-ended tasks)
  • Specialized verifiers (factuality, safety, style)

4.1 LLM-as-Judge Shape

async function scoreWithLLM(
  expected: string | undefined,
  answer: string,
  prompt: string,
): Promise<number> {
  if (!expected) {
    // no ground truth: score relevance and usefulness
    const judgment = await callLLM("judge-model", {
      system: "Score the answer from 0 to 1 for usefulness and correctness.",
      user: JSON.stringify({ prompt, answer }),
    });
    return parseFloatScore(judgment);
  }

  const judgment = await callLLM("judge-model", {
    system: "Score the answer from 0 to 1 based on how well it matches expected.",
    user: JSON.stringify({ expected, answer }),
  });

  return parseFloatScore(judgment);
}
Enter fullscreen mode Exit fullscreen mode

4.2 Full Evaluation Step for One Router + Case

async function evaluateCaseWithRouter(
  router: Router,
  testCase: RoutingEvalCase,
): Promise<{ trace: RouterDecisionTrace; result: RouterEvalResult }> {
  const { answer, trace } = await router.route({
    prompt: testCase.prompt,
    meta: testCase.meta,
  });

  // back-fill caseId
  trace.caseId = testCase.id;

  const qualityScore = await scoreWithLLM(
    testCase.expected,
    answer,
    testCase.prompt,
  );

  const costTokensTotal = trace.modelCalls.reduce(
    (sum, call) => sum + call.costTokens,
    0,
  );
  const latencyTotalMs = trace.modelCalls.reduce(
    (sum, call) => sum + call.latencyMs,
    0,
  );

  const result: RouterEvalResult = {
    routerId: trace.routerId,
    caseId: trace.caseId,
    qualityScore,
    pass: qualityScore >= 0.8, // configurable
    costTokensTotal,
    latencyTotalMs,
    riskFlags: [], // fill via guardrail/safety checks if desired
  };

  return { trace, result };
}
Enter fullscreen mode Exit fullscreen mode

5. Offline A/B: Batch Evaluation Across Routers

5.1 Basic Harness

async function evaluateRoutersOnDataset(
  routers: Router[],
  dataset: RoutingEvalCase[],
) {
  const allResults: RouterEvalResult[] = [];
  const allTraces: RouterDecisionTrace[] = [];

  for (const testCase of dataset) {
    for (const router of routers) {
      const { trace, result } = await evaluateCaseWithRouter(router, testCase);
      allTraces.push(trace);
      allResults.push(result);
    }
  }

  return { allTraces, allResults };
}
Enter fullscreen mode Exit fullscreen mode

5.2 Aggregation and Comparison

type AggregatedMetrics = {
  routerId: string;
  passRate: number;
  avgQuality: number;
  avgCostTokens: number;
  avgLatencyMs: number;
  cases: number;
};

function aggregateResults(results: RouterEvalResult[]): AggregatedMetrics[] {
  const byRouter = new Map<string, RouterEvalResult[]>();

  for (const r of results) {
    if (!byRouter.has(r.routerId)) byRouter.set(r.routerId, []);
    byRouter.get(r.routerId)!.push(r);
  }

  const aggregates: AggregatedMetrics[] = [];
  for (const [routerId, list] of byRouter.entries()) {
    const cases = list.length;
    const passRate = list.filter(r => r.pass).length / cases;
    const avgQuality =
      list.reduce((sum, r) => sum + r.qualityScore, 0) / cases;
    const avgCostTokens =
      list.reduce((sum, r) => sum + r.costTokensTotal, 0) / cases;
    const avgLatencyMs =
      list.reduce((sum, r) => sum + r.latencyTotalMs, 0) / cases;

    aggregates.push({
      routerId,
      passRate,
      avgQuality,
      avgCostTokens,
      avgLatencyMs,
      cases,
    });
  }

  return aggregates;
}
Enter fullscreen mode Exit fullscreen mode

You can extend this to:

  • Group by domain, difficulty, risk
  • Compute Pareto frontiers (quality vs cost)
  • Highlight cases where routers diverge sharply

6. Online A/B: Shadow Mode for Routers

Offline eval is necessary but not sufficient. You also need to see how routers behave on live traffic.

6.1 Shadow Evaluation Pattern

Production flow:

  1. Online router (Router A) handles real user request
  2. In the background, Router B runs on the same input (no user impact)
  3. Compare results offline with user feedback, cost, and guardrail behavior
async function handleUserRequest(prompt: string, userId: string) {
  // primary router
  const primaryRouter = getPrimaryRouter();
  const { answer, trace } = await primaryRouter.route({ prompt });

  // fire-and-forget shadow eval
  void runShadowRouters(prompt, trace, userId);

  return answer;
}

async function runShadowRouters(
  prompt: string,
  primaryTrace: RouterDecisionTrace,
  userId: string,
) {
  const shadowRouters = getShadowRouters();

  for (const router of shadowRouters) {
    const { answer, trace } = await router.route({ prompt });

    await logShadowComparison({
      userId,
      prompt,
      primary: primaryTrace,
      shadow: trace,
      shadowAnswer: answer,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Later, you can:

  • Compare cost/quality distributions
  • Identify cases where shadow router clearly dominates
  • Build confidence before switching primary

7. Evaluating Router Behavior, Not Just Outputs

For second-half routing, outputs are not the only thing that matters. You also care about:

  • Strategy choice: Did it pick the right path given risk/difficulty?
  • Escalation behavior: Did it escalate when it should?
  • Guardrail behavior: Did it refuse when policy required?
  • Over-spend vs under-spend: Did it waste cost on low-value tasks?

You can encode these as behavioral assertions.

7.1 Behavioral Checks

type BehaviorCheckResult = {
  routerId: string;
  caseId: string;
  checkName: string;
  passed: boolean;
  details?: string;
};

function checkEscalationBehavior(
  trace: RouterDecisionTrace,
  testCase: RoutingEvalCase,
): BehaviorCheckResult {
  const isHighRisk = testCase.meta?.risk === "high";
  const escalated =
    trace.modelCalls.some(call => call.role === "reasoning") &&
    trace.modelCalls.some(call => call.role === "verifier");

  const passed = !isHighRisk || escalated;

  return {
    routerId: trace.routerId,
    caseId: trace.caseId,
    checkName: "HIGH_RISK_ESCALATION",
    passed,
    details: passed
      ? "OK"
      : "High-risk case did not escalate to reasoning + verifier",
  };
}
Enter fullscreen mode Exit fullscreen mode

You can write similar checks for:

  • "Low-risk tasks should use cheap paths"
  • "Guardrails must trigger on forbidden content"
  • "Clarification must happen for ambiguous tasks"

7.2 Integrating Behavior Checks

function runBehaviorChecks(
  trace: RouterDecisionTrace,
  testCase: RoutingEvalCase,
): BehaviorCheckResult[] {
  return [
    checkEscalationBehavior(trace, testCase),
    // add more checks here...
  ];
}
Enter fullscreen mode Exit fullscreen mode

Integrate into evaluateCaseWithRouter:

const behaviorChecks = runBehaviorChecks(trace, testCase);
// store alongside RouterEvalResult
Enter fullscreen mode Exit fullscreen mode

Now you're not just measuring what the router answered, but how it behaved.


8. Reporting: What You Actually Look At

The harness should give you:

  • Per-router aggregates (pass rate, avg cost, avg latency)
  • Per-domain/difficulty breakdowns
  • Behavior check summaries (how often policies were respected)
  • Case-level diffs where routers disagree strongly

Think in terms of:

Dimension Question
Quality vs Cost Is Router B worth the extra tokens?
Quality vs Latency Where does strategy tree search become too slow?
Behavioral Compliance Does the router respect risk policies?

You can dump AggregatedMetrics and behavior checks into a basic dashboard (or CSV → notebook):

const { allTraces, allResults } = await evaluateRoutersOnDataset(routers, dataset);
const aggregates = aggregateResults(allResults);
// render aggregates + key case diffs
Enter fullscreen mode Exit fullscreen mode

9. Design Principles for a Router Evaluation Harness

Router-first abstraction:

Define a clear Router interface with traces as first-class citizens.

Same dataset, same constraints:

Always compare routers under identical conditions.

Measure behavior, not just outputs:

Encode escalation/guardrail policies as behavior checks.

Use LLM-as-judge wisely:

Cache judgments. Use small judges where possible. Don't double-spend.

Make traces explorable:

You will need to inspect paths to debug routing policies.

Support offline and online modes:

Batch eval for design; shadow eval for deployment confidence.


What's Next

If you're building multi-model systems, agent frameworks, or safety-critical LLM apps, you need to evaluate the routing layer—not just the models.

This harness gives you a starting point:

  1. Define your Router interface
  2. Build an eval dataset with risk/difficulty metadata
  3. Run offline A/B across routing policies
  4. Add behavior checks for escalation, guardrails, cost discipline
  5. Deploy shadow routing to build confidence before switching

The goal isn't to find the "best" router. It's to understand which router behaves correctly under which conditions—and to keep learning as those conditions change.


For the routing patterns this harness is designed to evaluate, see Second-Half Routing: From Traffic Control to Collective Intelligence.

For what this all means outside infrastructure—including a family-level application of collective intelligence patterns—see my Substack post: From One Big Brain to the Family Brain.

Top comments (0)