Building Evals That Don't Lie: How to Make AI Evaluation Reliable in Production

#ai #evaluation #production #llm

The first time I ran an LLM scoring pipeline against a large batch of job listings, the results looked great on paper. Every listing had a score. Every score had a confidence level. The numbers were well distributed. I felt good about it for a while.

Then I spot-checked some random outputs. Most of them were wrong. The LLM had given high scores to irrelevant listings, low scores to perfect matches, and fabricated entire categories of data. The confidence levels meant nothing. The system was confidently wrong at scale.

That's the eval trap. You build a pipeline, it runs, it produces numbers, and you think you're done. You're not. Evaluating AI outputs is itself an engineering challenge, and if you treat it as an afterthought, your AI feature will ship broken.

I've been building production AI systems for a while now, including a high-traffic job board that processes over 10,000 listings daily with an LLM scoring pipeline. Here's what I've learned about making evals that actually tell you the truth.

Structured Output Is Your First Line of Defense

The first mistake most teams make is asking an LLM for freeform text and then trying to parse it with regex or string matching. That's fragile. One model update changes the phrasing, and your parser breaks silently.

The fix is function calling with a strict schema. You tell the LLM exactly what shape the output must take. If it can't produce valid JSON that matches the schema, you reject the response and retry.

Here's a real schema I use for scoring job listings against a candidate profile:

import { z } from 'zod';

const JobScoreSchema = z.object({
  relevanceScore: z.number().min(0).max(100).describe('Overall relevance 0-100'),
  hasRequiredSkills: z.boolean().describe('Does the listing require the candidate skills?'),
  requiredSkills: z.array(z.string()).describe('Skills explicitly required'),
  hasLocationMatch: z.boolean(),
  location: z.string().optional(),
  hasSalaryData: z.boolean(),
  salaryRange: z.object({
    min: z.number().positive(),
    max: z.number().positive(),
    currency: z.string().length(3)
  }).optional(),
  reasons: z.array(z.string()).max(5).describe('Top 3-5 reasons for this score')
});

Notice the has* boolean guards. They're critical. If the listing doesn't mention salary, hasSalaryData is false and the salaryRange object is absent. The schema doesn't allow the LLM to fabricate salary data just because it wants to fill a field. This pattern comes from the anti-hallucination architecture I used in a resume tailoring tool, and it works across domains.

When the LLM returns this structured output, you validate it against the Zod schema before you trust a single field. If validation fails, you log the raw response, flag it for review, and optionally retry with a stronger prompt. This catches the majority of hallucinations before they hit your database.

Embedding Similarity as a Drift Detector

Structured output validation catches format errors. It doesn't catch semantic drift. An LLM can produce a perfectly valid JSON object that is semantically garbage. The relevance score might be high, but the actual match is a completely unrelated job.

To catch semantic drift, I compute embeddings for the LLM output and compare them against a reference set of known-good outputs. If the cosine similarity drops below a threshold, something changed.

import { OpenAI } from 'openai';

const openai = new OpenAI();

async function checkSemanticDrift(output: string, referenceEmbedding: number[]): Promise<boolean> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: output
  });

  const similarity = cosineSimilarity(response.data[0].embedding, referenceEmbedding);

  // If similarity drops below a threshold, flag for review
  return similarity >= 0.85;
}

function cosineSimilarity(a: number[], b: number[]): number {
  const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magnitudeA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (magnitudeA * magnitudeB);
}

I built a reference embedding for each type of output. For job scoring, I took a set of human-verified good outputs, averaged their embeddings, and used that as the baseline. When the pipeline started scoring new listings, any output whose embedding fell below the threshold was routed to manual review.

Suppose an LLM update subtly changes the tone or structure of outputs. The embedding similarity would flag it, even if the output passes structural validation. That semantic check catches what schemas miss.

The Human Threshold: When to Route

You can't review every output manually. At 10,000 listings a day, that's not feasible. But you can't trust every output blindly either. The solution is a tiered threshold system.

I use three zones based on the embedding similarity score:

High similarity: Auto-approve. The output closely matches known-good patterns.
Medium similarity: Auto-approve but log for review. Flag it in the dashboard for spot checks.
Low similarity: Route to human review. Block the output from being used until someone signs off.

The thresholds are not magic numbers. You tune them by running a batch of examples, having a human rate each one, and finding the similarity cutoff that maximizes precision while keeping the review volume at a manageable level. For my pipeline, a tight threshold kept the review rate under a manageable fraction of daily listings while catching the vast majority of bad outputs.

This isn't a set-it-and-forget-it number. As the LLM model changes or the data distribution shifts, you need to recalibrate. I run a calibration batch after any model update or when I detect a shift in the output distribution.

Monitoring Eval Drift Over Time

The hardest lesson I learned is that your eval system itself will drift. The embedding baseline becomes stale. The human reviewers get fatigued. The LLM's behavior changes with API updates.

You need to monitor the monitor. I track three metrics:

Precision: Of the outputs flagged as bad, how many were actually bad?
Recall: Of the truly bad outputs, how many did the eval catch?
Review Agreement: When two human reviewers rate the same output, how often do they agree?

I store every flagged output along with the final human decision. Regularly, I run a report that compares the eval's verdict against the human verdict. If precision drops significantly or recall drops significantly, I know something is off. Usually it's a drift in the embedding baseline, and I regenerate it from the last batch of verified outputs.

I also track the distribution of similarity scores over time. If the average similarity suddenly shifts, it's a signal that the LLM's output distribution changed. Imagine a model provider updates their default behavior. The average similarity might drop sharply overnight. If you're watching the trend, you catch it within hours instead of weeks.

If your team is wrestling with unreliable AI outputs and shipping slower because of it, that's the kind of thing I help with. I build production AI pipelines that don't lie to you. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.