I've watched teams spend weeks refining an LLM scoring pipeline, only to run it against real data and discover that many of the scores are useless. The model rewards keyword density over actual relevance. The output looks structured. The numbers are in range. But the results don't match what a human would judge.
That's the moment you realize: you don't ship AI features by writing prompts first. You ship them by writing the evaluation first.
The 80/20 Trap Nobody Talks About
Most teams building AI features follow the same pattern. They wire up an LLM call, test it on three examples, and call it done. Then production hits and they discover the model hallucinates on edge cases, ignores instructions, or produces output that looks right but is wrong.
The problem isn't the model. It's that you evaluated your system on the wrong thing.
In my experience, the 80/20 rule applies differently to AI features than traditional software. The first 20% of the work gets you 80% of the way there. The remaining 80% is all evaluation: catching edge cases, measuring quality, and deciding whether to accept or reject outputs.
If you don't define your evaluation criteria before you write a single prompt, you'll spend that 80% in a reactive fire drill. You'll fix bugs as they surface rather than preventing them.
How I Structure an Eval-First Pipeline
On a job board platform I worked on, the system processes 10,000+ listings daily. Each listing needs a relevance score for each candidate profile. The LLM generates a structured JSON output with a score and reasoning.
Here's the eval structure I built before writing the scoring prompt:
interface ListingScoringEval {
scoreRange: {
min: number; // 0
max: number; // 100
};
requiredFields: string[]; // ['score', 'reasoning', 'matched_skills']
edgeCases: {
emptyDescription: 'reject';
nonEnglishText: 'reject';
duplicateListing: 'deduplicate';
partialMatch: 'accept_with_lower_score';
};
qualityThresholds: {
minimumScore: 10; // below this = no match
reasoningRequired: true;
reasoningMinWords: 15;
};
hallucinationGuards: {
noFabricatedSkills: true;
noCompanyNamesNotInInput: true;
scoreMustMatchReasoning: true;
};
}
This isn't a prompt. It's a contract. It tells me exactly what valid output looks like before the model generates anything.
The eval does three things. First, it validates that the output structure is correct (right fields, right types). Second, it checks that the output is internally consistent (the reasoning justifies the score). Third, it rejects known failure modes (empty input, fabricated data).
The Real Test: Edge Cases You Didn't Think Of
The eval caught something I hadn't considered. Some job listings contained only a company name and a title with no description. The LLM would generate a score anyway, but the reasoning would be generic and meaningless.
The eval flagged these: minimum score of 10 meant anything below that was automatically rejected. But more importantly, the reasoning check caught the empty listings because the model couldn't generate 15 meaningful words about a job with no content.
Here's the validation function that runs before the output reaches the database:
function validateScoringOutput(
output: LLMScoringOutput,
input: JobListing
): EvalResult {
const errors: string[] = [];
// Structure check
if (!output.score || typeof output.score !== 'number') {
errors.push('Missing or invalid score field');
}
if (!output.reasoning || output.reasoning.length < 15) {
errors.push('Reasoning too short or missing');
}
// Range check
if (output.score < 0 || output.score > 100) {
errors.push(`Score ${output.score} outside valid range`);
}
// Consistency check
if (output.score > 80 && output.reasoning.includes('no relevant skills')) {
errors.push('Score contradicts reasoning');
}
// Hallucination guard
const inputText = `${input.title} ${input.description}`.toLowerCase();
const fabricatedSkills = output.matchedSkills.filter(
skill => !inputText.includes(skill.toLowerCase())
);
if (fabricatedSkills.length > 0) {
errors.push(`Fabricated skills detected: ${fabricatedSkills.join(', ')}`);
}
return {
passed: errors.length === 0,
errors,
score: output.score
};
}
This function runs on every single output. It rejects anything that doesn't pass. No exceptions. If the eval fails, the output doesn't reach the user.
What Happens When You Skip This Step
Suppose you skip the eval and ship directly. The first week looks fine. Then a recruiter searches for "React developer" and gets a listing for a Java backend role scored at 85. They click it, waste time, and lose trust in the platform.
That's a single bad output. The real cost is cumulative. Every bad output trains your users to ignore the AI feature. They stop trusting the scores. They stop using the filters. You've built a feature that actively degrades the user experience.
I've seen this pattern repeat across multiple projects. Teams ship an AI feature, it works on the happy path, then it quietly fails on edge cases until someone notices. The fix is always the same: add evaluation. But by then you're retrofitting guards onto a system that wasn't designed for them.
The Eval-First Workflow
Here's the workflow I use now. It takes longer upfront but saves weeks of debugging later.
- Write the eval contract before the prompt. Define valid output shape, ranges, and rejection criteria.
- Build the validation function. It should reject bad outputs automatically.
- Write the prompt against the eval. You know exactly what the output needs to look like.
- Test on 100 real examples, not 3. Run the eval on every one.
- Iterate the prompt until the eval passes on the vast majority of cases.
- Ship with the eval running in production. Log every rejection.
The key insight is that the eval isn't a testing step. It's a production guard. It runs on every output, every time. If the model drifts or a new edge case appears, the eval catches it before it reaches the user.
Why This Matters for Founders
If you're building an AI feature, the quality of your output determines whether users trust it. A feature that works most of the time is worse than no feature at all, because the failures erode trust faster than the successes build it.
The eval-first approach forces you to define what "good" looks like before you start. It makes the failure modes explicit. And it gives you a mechanism to catch bad outputs in production, not just in testing.
If your team is shipping AI features and hitting quality issues in production, that's exactly the kind of thing I help with. Happy to compare notes on how to structure an eval pipeline for your specific use case.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (0)