Saurav Bhattacharya

Posted on Jun 5

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

#ai #testing #agents #evaluation

The Core Problem

You shipped an AI agent. It works in demos. Then it runs 10,000 times in production, and you realize you have no idea which runs were good.

This is the agent evaluation problem, and most teams approach it backwards. They reach for model-as-judge ("ask GPT-4 if the output is good") because it feels natural. But this is like using a microscope when you needed a ruler first.

Here's my thesis: a tiered evaluation architecture—deterministic checks first, model-as-judge only where necessary—catches more failures, costs less, and gives you actionable signal faster.

The Three Tiers

After building eval infrastructure for production agents, I've landed on a three-tier model:

Tier 1: Deterministic Assertions (catches ~60% of failures)

These are the boring checks. They're also the most valuable.

interface DeterministicCheck {
  name: string;
  check: (output: AgentOutput) => { pass: boolean; reason?: string };
}

const structuralChecks: DeterministicCheck[] = [
  {
    name: 'valid-json-output',
    check: (output) => {
      try {
        JSON.parse(output.raw);
        return { pass: true };
      } catch (e) {
        return { pass: false, reason: `Invalid JSON: ${e.message}` };
      }
    }
  },
  {
    name: 'no-hallucinated-urls',
    check: (output) => {
      const urls = extractUrls(output.raw);
      const knownDomains = getAllowedDomains(output.context);
      const unknown = urls.filter(u => !knownDomains.includes(new URL(u).hostname));
      return {
        pass: unknown.length === 0,
        reason: unknown.length > 0 ? `Unknown domains: ${unknown.join(', ')}` : undefined
      };
    }
  },
  {
    name: 'required-fields-present',
    check: (output) => {
      const parsed = JSON.parse(output.raw);
      const missing = output.schema.required.filter(f => !(f in parsed));
      return {
        pass: missing.length === 0,
        reason: missing.length > 0 ? `Missing: ${missing.join(', ')}` : undefined
      };
    }
  }
];

Why start here? Three reasons:

Latency: These run in <1ms. You can evaluate every single agent run, not a sample.
Determinism: No flaky results. A JSON parse error is a JSON parse error.
Debuggability: When a check fails, you know exactly what broke.

Tier 2: Heuristic Scoring (catches ~20% more)

These are computed properties that don't require an LLM but capture quality signals:

interface HeuristicScore {
  name: string;
  score: (output: AgentOutput) => number; // 0-1
  threshold: number;
}

const heuristics: HeuristicScore[] = [
  {
    name: 'response-conciseness',
    score: (output) => {
      const tokenCount = estimateTokens(output.raw);
      const expectedRange = output.context.expectedTokenRange;
      if (tokenCount < expectedRange[0]) return tokenCount / expectedRange[0];
      if (tokenCount > expectedRange[1]) return expectedRange[1] / tokenCount;
      return 1.0;
    },
    threshold: 0.7
  },
  {
    name: 'context-utilization',
    score: (output) => {
      const contextEntities = extractEntities(output.context.provided);
      const outputEntities = extractEntities(output.raw);
      const used = contextEntities.filter(e => outputEntities.includes(e));
      return used.length / Math.max(contextEntities.length, 1);
    },
    threshold: 0.3
  }
];

These aren't pass/fail—they're continuous scores. Track them over time. When the distribution shifts, something changed.

Tier 3: Model-as-Judge (the remaining ~20%)

Only now—after deterministic and heuristic checks pass—do you invoke an LLM judge:

async function modelJudge(output: AgentOutput, criteria: JudgeCriteria): Promise<Judgment> {
  const prompt = `
You are evaluating an AI agent's output for: ${criteria.dimension}

Context provided to agent:
${output.context.summary}

Agent output:
${output.raw}

Score 1-5 on ${criteria.dimension}. Respond with JSON:
{"score": <int>, "reasoning": "<one sentence>"}
`;

  const result = await judge.complete(prompt, { temperature: 0 });
  return JSON.parse(result);
}

Critical design decisions for model-as-judge:

Use it for subjective dimensions only: tone, helpfulness, nuance. Not for things you can check deterministically.
Temperature 0, always: You want reproducibility, not creativity.
Structured output: Force JSON responses. Parse failures become Tier 1 check failures on your judge itself.
Judge the judge: Run your model-as-judge on known-good and known-bad examples. Track its accuracy. It drifts.

Why Order Matters

The tiered approach isn't just about efficiency. It's about failure attribution.

When a Tier 1 check fails, you know the agent produced structurally invalid output. You can fix the prompt, the schema enforcement, or the output parser. The fix is concrete.

When a model-as-judge scores something 2/5, you know... something is subjectively wrong. Maybe. Unless the judge is having a bad day. The signal is useful in aggregate but dangerous for individual decisions.

async function evaluateRun(output: AgentOutput): Promise<EvalResult> {
  // Tier 1: Fast, deterministic. Gate everything else.
  const tier1Results = structuralChecks.map(c => c.check(output));
  if (tier1Results.some(r => !r.pass)) {
    return { tier: 1, pass: false, results: tier1Results, skipHigherTiers: true };
  }

  // Tier 2: Heuristic scores. Flag but don't block.
  const tier2Results = heuristics.map(h => ({ 
    name: h.name, 
    score: h.score(output), 
    threshold: h.threshold 
  }));

  // Tier 3: Only if Tier 1 passes and context warrants it.
  const tier3Results = output.context.requiresSubjectiveEval 
    ? await runModelJudge(output)
    : null;

  return { tier: 3, pass: true, results: { tier1Results, tier2Results, tier3Results } };
}

The Economics

Let's make it concrete. Say you run 10,000 agent invocations per day:

Approach	Cost/day	Latency added	Signal quality
Model-as-judge on everything	~$50-150	2-5s per eval	High but noisy
Tiered (deterministic first)	~$5-15	<10ms for 80%	Higher and cleaner

The tiered approach is 10x cheaper because you only invoke the expensive judge on the ~20% of runs that pass structural checks AND require subjective evaluation.

What This Looks Like in Practice

In agent-eval, we implement this as a pipeline where checks are composable:

const pipeline = createEvalPipeline([
  tier('deterministic', [jsonValid, schemaConforms, noHallucinatedLinks]),
  tier('heuristic', [conciseness, contextUsage, formatAdherence]),
  tier('judge', [helpfulness, technicalAccuracy], { sampleRate: 0.2 })
]);

// Run on every agent output
const result = await pipeline.evaluate(agentOutput);

Note the sampleRate: 0.2 on the judge tier. Even within Tier 3, you don't need to judge everything. Sample, aggregate, alert on distribution shifts.

The Takeaway

If you're building agent evaluation, resist the temptation to start with model-as-judge. Start with the dumbest possible checks:

Did the agent return valid JSON?
Are all required fields present?
Are the URLs real?
Is the response length reasonable?

These catch the majority of production failures. They're fast, cheap, deterministic, and debuggable. Save the expensive subjective evaluation for where it actually matters.

What's your experience with agent evaluation in production? Are you running model-as-judge on everything, or have you found ways to tier your checks? I'd love to hear what patterns have worked.

DEV Community