Abdul Rehman

Posted on Jun 26

Ship AI Features Without the Fire Drill: Write the Eval First

#ai #llm #testing #production

I've watched teams spend weeks refining an LLM scoring pipeline, only to run it against real data and discover that many of the scores are useless. The model rewards keyword density over actual relevance. The output looks structured. The numbers are in range. But the results don't match what a human would judge.

That's the moment you realize: you don't ship AI features by writing prompts first. You ship them by writing the evaluation first.

The 80/20 Trap Nobody Talks About

Most teams building AI features follow the same pattern. They wire up an LLM call, test it on three examples, and call it done. Then production hits and they discover the model hallucinates on edge cases, ignores instructions, or produces output that looks right but is wrong.

The problem isn't the model. It's that you evaluated your system on the wrong thing.

In my experience, the 80/20 rule applies differently to AI features than traditional software. The first 20% of the work gets you 80% of the way there. The remaining 80% is all evaluation: catching edge cases, measuring quality, and deciding whether to accept or reject outputs.

If you don't define your evaluation criteria before you write a single prompt, you'll spend that 80% in a reactive fire drill. You'll fix bugs as they surface rather than preventing them.

How I Structure an Eval-First Pipeline

On a job board platform I worked on, the system processes 10,000+ listings daily. Each listing needs a relevance score for each candidate profile. The LLM generates a structured JSON output with a score and reasoning.

Here's the eval structure I built before writing the scoring prompt:

interface ListingScoringEval {
  scoreRange: {
    min: number;  // 0
    max: number;  // 100
  };
  requiredFields: string[];  // ['score', 'reasoning', 'matched_skills']
  edgeCases: {
    emptyDescription: 'reject';
    nonEnglishText: 'reject';
    duplicateListing: 'deduplicate';
    partialMatch: 'accept_with_lower_score';
  };
  qualityThresholds: {
    minimumScore: 10;  // below this = no match
    reasoningRequired: true;
    reasoningMinWords: 15;
  };
  hallucinationGuards: {
    noFabricatedSkills: true;
    noCompanyNamesNotInInput: true;
    scoreMustMatchReasoning: true;
  };
}

This isn't a prompt. It's a contract. It tells me exactly what valid output looks like before the model generates anything.

The eval does three things. First, it validates that the output structure is correct (right fields, right types). Second, it checks that the output is internally consistent (the reasoning justifies the score). Third, it rejects known failure modes (empty input, fabricated data).

The Real Test: Edge Cases You Didn't Think Of

The eval caught something I hadn't considered. Some job listings contained only a company name and a title with no description. The LLM would generate a score anyway, but the reasoning would be generic and meaningless.

The eval flagged these: minimum score of 10 meant anything below that was automatically rejected. But more importantly, the reasoning check caught the empty listings because the model couldn't generate 15 meaningful words about a job with no content.

Here's the validation function that runs before the output reaches the database:

function validateScoringOutput(
  output: LLMScoringOutput,
  input: JobListing
): EvalResult {
  const errors: string[] = [];

  // Structure check
  if (!output.score || typeof output.score !== 'number') {
    errors.push('Missing or invalid score field');
  }
  if (!output.reasoning || output.reasoning.length < 15) {
    errors.push('Reasoning too short or missing');
  }

  // Range check
  if (output.score < 0 || output.score > 100) {
    errors.push(`Score ${output.score} outside valid range`);
  }

  // Consistency check
  if (output.score > 80 && output.reasoning.includes('no relevant skills')) {
    errors.push('Score contradicts reasoning');
  }

  // Hallucination guard
  const inputText = `${input.title} ${input.description}`.toLowerCase();
  const fabricatedSkills = output.matchedSkills.filter(
    skill => !inputText.includes(skill.toLowerCase())
  );
  if (fabricatedSkills.length > 0) {
    errors.push(`Fabricated skills detected: ${fabricatedSkills.join(', ')}`);
  }

  return {
    passed: errors.length === 0,
    errors,
    score: output.score
  };
}

This function runs on every single output. It rejects anything that doesn't pass. No exceptions. If the eval fails, the output doesn't reach the user.

What Happens When You Skip This Step

Suppose you skip the eval and ship directly. The first week looks fine. Then a recruiter searches for "React developer" and gets a listing for a Java backend role scored at 85. They click it, waste time, and lose trust in the platform.

That's a single bad output. The real cost is cumulative. Every bad output trains your users to ignore the AI feature. They stop trusting the scores. They stop using the filters. You've built a feature that actively degrades the user experience.

I've seen this pattern repeat across multiple projects. Teams ship an AI feature, it works on the happy path, then it quietly fails on edge cases until someone notices. The fix is always the same: add evaluation. But by then you're retrofitting guards onto a system that wasn't designed for them.

The Eval-First Workflow

Here's the workflow I use now. It takes longer upfront but saves weeks of debugging later.

Write the eval contract before the prompt. Define valid output shape, ranges, and rejection criteria.
Build the validation function. It should reject bad outputs automatically.
Write the prompt against the eval. You know exactly what the output needs to look like.
Test on 100 real examples, not 3. Run the eval on every one.
Iterate the prompt until the eval passes on the vast majority of cases.
Ship with the eval running in production. Log every rejection.

The key insight is that the eval isn't a testing step. It's a production guard. It runs on every output, every time. If the model drifts or a new edge case appears, the eval catches it before it reaches the user.

Why This Matters for Founders

If you're building an AI feature, the quality of your output determines whether users trust it. A feature that works most of the time is worse than no feature at all, because the failures erode trust faster than the successes build it.

The eval-first approach forces you to define what "good" looks like before you start. It makes the failure modes explicit. And it gives you a mechanism to catch bad outputs in production, not just in testing.

If your team is shipping AI features and hitting quality issues in production, that's exactly the kind of thing I help with. Happy to compare notes on how to structure an eval pipeline for your specific use case.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Top comments (2)

Mateo Ruiz • Jun 26

A lot of teams discover this the hard way: agent failures are rarely infrastructure failures—they're visibility failures. If every step returns "success" but one hallucinated assumption slips through early, the final outcome can still be completely wrong.

What's interesting here isn't just tracing prompts and responses, it's treating agent workflows like distributed systems: step-level observability, drift detection, circuit breakers, cost controls, and replayability. That's where production reliability starts to emerge.

This is also why many organizations moving beyond prototypes invest heavily in observability and governance layers before adding more agents or bigger models. Without that visibility, scaling agents often just scales hidden failure modes. Teams working on production AI systems, including projects we've supported at IT Path Solutions, usually find that better monitoring and validation deliver bigger reliability gains than switching models.

David Flores Flores • Jun 26

Good framing. The part that stands out to me is treating the eval as a contract, not as an after-the-fact metric.

From a QA perspective, I would add one extra layer before the model prompt: map the eval criteria back to the user story or acceptance criteria. That makes it much easier to catch cases where the model output is valid JSON but still misses the business requirement.

A practical flow I like is:

Define expected output shape and hard failure rules
Map each rule to a requirement or risk
Add negative and edge-case examples before prompt tuning
Review generated outputs against that map, not just against schema validity

Otherwise the team can end up with a technically passing eval that still gives users a wrong or incomplete answer. That is usually where QA review adds the most value.