Narnaiezzsshaa Truong

Posted on Jan 20

RouterEval: An Evaluation Harness for LLM Routing Policies

#ai #typescript #testing #machinelearning

How do you know if your router is actually better? A code-oriented framework for A/B testing routing logic, not just models.

If you're building second-half routing—routers that reason, search, and learn—you need a way to evaluate them. Not just the models behind the router. The routing logic itself.

This post walks through a concrete, code-oriented design for RouterEval: a small framework for answering questions like:

"Is Router B actually better than Router A under the same constraints?"
"Which routing policy gives me the best cost/quality frontier?"
"How does my router behave under multi-turn, adversarial, or high-risk tasks?"

(This is Part 2. For the routing patterns themselves, see Second-Half Routing: From Traffic Control to Collective Intelligence

1. Goals and Core Idea

We're not evaluating individual models. We're evaluating routing policies:

Router A: simple cost-based model selection
Router B: semantic routing + strategy tree
Router C: semantic + verifier + reflexion

Key properties of the harness:

Same dataset across routers
Same constraints (budget, latency bounds)
Logs routed paths, not just final outputs
Supports online and offline A/B
Tracks cost/quality/risk tradeoffs

2. Data Model: What You Evaluate On

At minimum, you need a routing eval dataset:

type RoutingEvalCase = {
  id: string;
  prompt: string;
  expected?: string; // for tasks with ground truth
  meta?: {
    domain?: string;
    difficulty?: "low" | "medium" | "high";
    risk?: "low" | "medium" | "high";
    type?: "qa" | "summarization" | "generation" | "tooling";
  };
};

And a way to store router decisions and outcomes:

type RouterDecisionTrace = {
  routerId: string;       // "router-A", "router-B", etc.
  caseId: string;
  strategy: string;       // e.g., "SLM_DIRECT", "LLM_WITH_VERIFIER"
  path: string[];         // sequence of nodes / steps taken
  modelCalls: {
    model: string;
    costTokens: number;
    latencyMs: number;
    role: string; // "analysis" | "reasoning" | "verifier" | ...
  }[];
  finalOutput: string;
  metadata?: any;         // e.g., self-eval scores, uncertainty estimates
};

And evaluation results per run:

type RouterEvalResult = {
  routerId: string;
  caseId: string;
  qualityScore: number;     // 0-1
  pass: boolean;            // thresholded
  costTokensTotal: number;
  latencyTotalMs: number;
  riskFlags?: string[];     // e.g., "hallucination", "policy_violation"
};

3. The Router Interface

Define a clear interface so you can plug in different routing policies:

type RouteContext = {
  prompt: string;
  meta?: RoutingEvalCase["meta"];
  // optional: session state, user id, etc.
};

type RouterOutput = {
  answer: string;
  trace: RouterDecisionTrace;
};

interface Router {
  id: string;
  route(ctx: RouteContext): Promise<RouterOutput>;
}

Example implementation:

class SimpleRouter implements Router {
  id = "simple-router";

  async route(ctx: RouteContext): Promise<RouterOutput> {
    const model = pickCheapModel();
    const { output, costTokens, latencyMs } = await callModel(model, ctx.prompt);

    const trace: RouterDecisionTrace = {
      routerId: this.id,
      caseId: "", // filled by harness
      strategy: "CHEAP_DIRECT",
      path: ["CHEAP_DIRECT"],
      modelCalls: [
        { model, costTokens, latencyMs, role: "direct" },
      ],
      finalOutput: output,
    };

    return { answer: output, trace };
  }
}

4. Quality Scoring: Ground Truth + LLM as Judge

You'll typically mix:

Exact / fuzzy string matching (for deterministic tasks)
LLM-as-judge scoring (for open-ended tasks)
Specialized verifiers (factuality, safety, style)

4.1 LLM-as-Judge Shape

async function scoreWithLLM(
  expected: string | undefined,
  answer: string,
  prompt: string,
): Promise<number> {
  if (!expected) {
    // no ground truth: score relevance and usefulness
    const judgment = await callLLM("judge-model", {
      system: "Score the answer from 0 to 1 for usefulness and correctness.",
      user: JSON.stringify({ prompt, answer }),
    });
    return parseFloatScore(judgment);
  }

  const judgment = await callLLM("judge-model", {
    system: "Score the answer from 0 to 1 based on how well it matches expected.",
    user: JSON.stringify({ expected, answer }),
  });

  return parseFloatScore(judgment);
}

4.2 Full Evaluation Step for One Router + Case

async function evaluateCaseWithRouter(
  router: Router,
  testCase: RoutingEvalCase,
): Promise<{ trace: RouterDecisionTrace; result: RouterEvalResult }> {
  const { answer, trace } = await router.route({
    prompt: testCase.prompt,
    meta: testCase.meta,
  });

  // back-fill caseId
  trace.caseId = testCase.id;

  const qualityScore = await scoreWithLLM(
    testCase.expected,
    answer,
    testCase.prompt,
  );

  const costTokensTotal = trace.modelCalls.reduce(
    (sum, call) => sum + call.costTokens,
    0,
  );
  const latencyTotalMs = trace.modelCalls.reduce(
    (sum, call) => sum + call.latencyMs,
    0,
  );

  const result: RouterEvalResult = {
    routerId: trace.routerId,
    caseId: trace.caseId,
    qualityScore,
    pass: qualityScore >= 0.8, // configurable
    costTokensTotal,
    latencyTotalMs,
    riskFlags: [], // fill via guardrail/safety checks if desired
  };

  return { trace, result };
}

5. Offline A/B: Batch Evaluation Across Routers

5.1 Basic Harness

async function evaluateRoutersOnDataset(
  routers: Router[],
  dataset: RoutingEvalCase[],
) {
  const allResults: RouterEvalResult[] = [];
  const allTraces: RouterDecisionTrace[] = [];

  for (const testCase of dataset) {
    for (const router of routers) {
      const { trace, result } = await evaluateCaseWithRouter(router, testCase);
      allTraces.push(trace);
      allResults.push(result);
    }
  }

  return { allTraces, allResults };
}

5.2 Aggregation and Comparison

type AggregatedMetrics = {
  routerId: string;
  passRate: number;
  avgQuality: number;
  avgCostTokens: number;
  avgLatencyMs: number;
  cases: number;
};

function aggregateResults(results: RouterEvalResult[]): AggregatedMetrics[] {
  const byRouter = new Map<string, RouterEvalResult[]>();

  for (const r of results) {
    if (!byRouter.has(r.routerId)) byRouter.set(r.routerId, []);
    byRouter.get(r.routerId)!.push(r);
  }

  const aggregates: AggregatedMetrics[] = [];
  for (const [routerId, list] of byRouter.entries()) {
    const cases = list.length;
    const passRate = list.filter(r => r.pass).length / cases;
    const avgQuality =
      list.reduce((sum, r) => sum + r.qualityScore, 0) / cases;
    const avgCostTokens =
      list.reduce((sum, r) => sum + r.costTokensTotal, 0) / cases;
    const avgLatencyMs =
      list.reduce((sum, r) => sum + r.latencyTotalMs, 0) / cases;

    aggregates.push({
      routerId,
      passRate,
      avgQuality,
      avgCostTokens,
      avgLatencyMs,
      cases,
    });
  }

  return aggregates;
}

You can extend this to:

Group by domain, difficulty, risk
Compute Pareto frontiers (quality vs cost)
Highlight cases where routers diverge sharply

6. Online A/B: Shadow Mode for Routers

Offline eval is necessary but not sufficient. You also need to see how routers behave on live traffic.

6.1 Shadow Evaluation Pattern

Production flow:

Online router (Router A) handles real user request
In the background, Router B runs on the same input (no user impact)
Compare results offline with user feedback, cost, and guardrail behavior

async function handleUserRequest(prompt: string, userId: string) {
  // primary router
  const primaryRouter = getPrimaryRouter();
  const { answer, trace } = await primaryRouter.route({ prompt });

  // fire-and-forget shadow eval
  void runShadowRouters(prompt, trace, userId);

  return answer;
}

async function runShadowRouters(
  prompt: string,
  primaryTrace: RouterDecisionTrace,
  userId: string,
) {
  const shadowRouters = getShadowRouters();

  for (const router of shadowRouters) {
    const { answer, trace } = await router.route({ prompt });

    await logShadowComparison({
      userId,
      prompt,
      primary: primaryTrace,
      shadow: trace,
      shadowAnswer: answer,
    });
  }
}

Later, you can:

Compare cost/quality distributions
Identify cases where shadow router clearly dominates
Build confidence before switching primary

7. Evaluating Router Behavior, Not Just Outputs

For second-half routing, outputs are not the only thing that matters. You also care about:

Strategy choice: Did it pick the right path given risk/difficulty?
Escalation behavior: Did it escalate when it should?
Guardrail behavior: Did it refuse when policy required?
Over-spend vs under-spend: Did it waste cost on low-value tasks?

You can encode these as behavioral assertions.

7.1 Behavioral Checks

type BehaviorCheckResult = {
  routerId: string;
  caseId: string;
  checkName: string;
  passed: boolean;
  details?: string;
};

function checkEscalationBehavior(
  trace: RouterDecisionTrace,
  testCase: RoutingEvalCase,
): BehaviorCheckResult {
  const isHighRisk = testCase.meta?.risk === "high";
  const escalated =
    trace.modelCalls.some(call => call.role === "reasoning") &&
    trace.modelCalls.some(call => call.role === "verifier");

  const passed = !isHighRisk || escalated;

  return {
    routerId: trace.routerId,
    caseId: trace.caseId,
    checkName: "HIGH_RISK_ESCALATION",
    passed,
    details: passed
      ? "OK"
      : "High-risk case did not escalate to reasoning + verifier",
  };
}

You can write similar checks for:

"Low-risk tasks should use cheap paths"
"Guardrails must trigger on forbidden content"
"Clarification must happen for ambiguous tasks"

7.2 Integrating Behavior Checks

function runBehaviorChecks(
  trace: RouterDecisionTrace,
  testCase: RoutingEvalCase,
): BehaviorCheckResult[] {
  return [
    checkEscalationBehavior(trace, testCase),
    // add more checks here...
  ];
}

Integrate into evaluateCaseWithRouter:

const behaviorChecks = runBehaviorChecks(trace, testCase);
// store alongside RouterEvalResult

Now you're not just measuring what the router answered, but how it behaved.

8. Reporting: What You Actually Look At

The harness should give you:

Per-router aggregates (pass rate, avg cost, avg latency)
Per-domain/difficulty breakdowns
Behavior check summaries (how often policies were respected)
Case-level diffs where routers disagree strongly

Think in terms of:

Dimension	Question
Quality vs Cost	Is Router B worth the extra tokens?
Quality vs Latency	Where does strategy tree search become too slow?
Behavioral Compliance	Does the router respect risk policies?

You can dump AggregatedMetrics and behavior checks into a basic dashboard (or CSV → notebook):

const { allTraces, allResults } = await evaluateRoutersOnDataset(routers, dataset);
const aggregates = aggregateResults(allResults);
// render aggregates + key case diffs

9. Design Principles for a Router Evaluation Harness

Router-first abstraction:

Define a clear Router interface with traces as first-class citizens.

Same dataset, same constraints:

Always compare routers under identical conditions.

Measure behavior, not just outputs:

Encode escalation/guardrail policies as behavior checks.

Use LLM-as-judge wisely:

Cache judgments. Use small judges where possible. Don't double-spend.

Make traces explorable:

You will need to inspect paths to debug routing policies.

Support offline and online modes:

Batch eval for design; shadow eval for deployment confidence.

What's Next

If you're building multi-model systems, agent frameworks, or safety-critical LLM apps, you need to evaluate the routing layer—not just the models.

This harness gives you a starting point:

Define your Router interface
Build an eval dataset with risk/difficulty metadata
Run offline A/B across routing policies
Add behavior checks for escalation, guardrails, cost discipline
Deploy shadow routing to build confidence before switching

The goal isn't to find the "best" router. It's to understand which router behaves correctly under which conditions—and to keep learning as those conditions change.

For the routing patterns this harness is designed to evaluate, see Second-Half Routing: From Traffic Control to Collective Intelligence.

For what this all means outside infrastructure—including a family-level application of collective intelligence patterns—see my Substack post: From One Big Brain to the Family Brain.

DEV Community