Gabriel Anhaia

Posted on Apr 2

Stop Vibes-Checking Your AI: A Practical Guide to LLM Evaluation

#programming #typescript #ai #testing

My project: Hermes IDE | GitHub
Me: gabrielanhaia

You changed one word in your system prompt and now 30% of your outputs are garbage. You wouldn't know that, though, because you tested it by running three examples and thinking "yeah, looks fine." I've done this. More than once. On a Friday afternoon before a deploy.

When I started building an AI-powered dev tool, I quickly realized that manually checking AI outputs doesn't scale past about day three. You need evals. Not the academic kind with 47-page papers behind them. The practical kind that tell you "this prompt change made things better" or "this prompt change broke everything."

Why You Can't Just Unit Test AI

Traditional testing works because functions are deterministic. Same input, same output. Write an assertion, move on.

LLMs don't work that way. Ask GPT to summarize the same paragraph twice and you'll get two different summaries. Both might be correct. Or one might hallucinate a detail that wasn't in the source. Or both might be fine but one is 400 tokens longer than you wanted.

Say you're building a feature that extracts action items from meeting notes:

const prompt = `Extract action items from these meeting notes. 
Return JSON array of strings.

Notes: "We need to migrate the database by Friday. 
Sarah will handle the API docs. John to review the PR."`;

Run this ten times. You might get:

Run 1:

["Migrate the database by Friday", "Sarah will handle API docs", "John to review the PR"]

Run 2:

["Database migration by Friday", "Sarah: handle API documentation", "John: PR review"]

Run 3:

["Migrate database (deadline: Friday)", "Sarah - API docs", "John - review PR", "Team to coordinate migration"]

All three are arguably correct. But run 3 invented a fourth action item that wasn't in the notes. How do you catch that in a test? You can't write expect(result).toEqual(...) because the exact strings change every time.

Three Types of Evals

Before writing code, you need the mental model. There are three approaches, and you'll probably end up using all of them.

Rule-based evals check things you can verify deterministically. Did the output parse as valid JSON? Is it under 500 tokens? Does it contain forbidden words? Dumb but fast. They catch the dumbest failures, which happen to be the most embarrassing ones in production.

Then there are model-graded evals, which flip the script: you use a second LLM to judge the first one's output. Send the output plus a scoring rubric to a grading model and ask "on a scale of 1-5, how accurate is this?" Sounds circular. In practice, it works because the grading model uses a completely different prompt and doesn't share the same blind spots.

And finally, human evals. You or someone on your team looks at outputs and scores them manually. Slowest by far, but nothing else catches subtle quality issues the way a human can. Use these to calibrate your automated evals, not as your day-to-day testing method.

This post focuses on the first two. Those are the ones you can automate and run in CI.

Rule-Based Evals: The Quick Wins

These are the if statements of the eval world. They're unglamorous and they'll save you from the dumbest bugs. They catch the cases where your LLM returns something structurally wrong before you even need to think about whether the content is good.

Start with the obvious checks:

interface EvalResult {
  name: string;
  passed: boolean;
  score: number; // 0 to 1
  detail?: string;
}

function evalValidJson(output: string): EvalResult {
  try {
    JSON.parse(output);
    return { name: "valid-json", passed: true, score: 1 };
  } catch (e) {
    return {
      name: "valid-json",
      passed: false,
      score: 0,
      detail: `Parse error: ${(e as Error).message}`,
    };
  }
}

function evalMaxLength(output: string, maxTokens: number): EvalResult {
  // rough approximation: 1 token ≈ 4 chars for English
  const estimatedTokens = Math.ceil(output.length / 4);
  const passed = estimatedTokens <= maxTokens;
  return {
    name: "max-length",
    passed,
    score: passed ? 1 : Math.max(0, 1 - (estimatedTokens - maxTokens) / maxTokens),
    detail: `${estimatedTokens} estimated tokens (limit: ${maxTokens})`,
  };
}

Those are trivial, but they catch real problems. I've had an LLM return a markdown code fence wrapped around JSON instead of raw JSON, which broke my parser downstream. A rule-based eval would've flagged that instantly.

Now something more useful. Say your action-item extractor should never invent items that aren't in the source text:

function evalNoHallucinations(
  output: string[],
  sourceText: string
): EvalResult {
  const sourceLower = sourceText.toLowerCase();
  const hallucinated = output.filter((item) => {
    // check if key nouns from the action item appear in source
    const words = item
      .toLowerCase()
      .split(/\s+/)
      .filter((w) => w.length > 3);
    const matchRatio = words.filter((w) => sourceLower.includes(w)).length / words.length;
    return matchRatio < 0.5; // less than half the words found in source
  });

  const score = 1 - hallucinated.length / output.length;
  return {
    name: "no-hallucinations",
    passed: hallucinated.length === 0,
    score,
    detail:
      hallucinated.length > 0
        ? `Possibly hallucinated: ${JSON.stringify(hallucinated)}`
        : undefined,
  };
}

Is this perfect? No. It's a rough heuristic based on word overlap. But it catches the obvious cases where the model invents action items out of thin air, and it runs in milliseconds. You'd be surprised how many problems a dumb string-matching check catches before you need anything smarter.

Model-Graded Evals: Using AI to Judge AI

Send the original input, the LLM's output, and a scoring rubric to a different model. Ask it to grade the output.

I know. Using AI to evaluate AI sounds like asking a student to grade their own exam. But you're using a different prompt (the rubric) and typically a different model, so the grading model doesn't share the same failure modes as the one that generated the output. It's more like asking a different student to grade the exam with an answer key.

The implementation:

import OpenAI from "openai";

const openai = new OpenAI();

interface GradingCriteria {
  name: string;
  description: string;
  scale: string; // e.g., "1-5 where 1 is terrible and 5 is perfect"
}

async function modelGradedEval(
  input: string,
  output: string,
  criteria: GradingCriteria
): Promise<EvalResult> {
  const gradingPrompt = `You are evaluating an AI assistant's output.

ORIGINAL INPUT:
${input}

AI OUTPUT:
${output}

CRITERIA: ${criteria.description}
SCALE: ${criteria.scale}

Respond with ONLY a JSON object: {"score": <number>, "reason": "<brief explanation>"}
Do not include any other text.`;

  const response = await openai.chat.completions.create({
    model: "gpt-4.1-mini", // cheaper model is fine for grading
    messages: [{ role: "user", content: gradingPrompt }],
    temperature: 0, // we want consistent grading
  });

  const content = response.choices[0].message.content ?? "";

  try {
    const parsed = JSON.parse(content);
    const normalized = parsed.score / 5; // normalize to 0-1
    return {
      name: criteria.name,
      passed: normalized >= 0.6,
      score: normalized,
      detail: parsed.reason,
    };
  } catch {
    return {
      name: criteria.name,
      passed: false,
      score: 0,
      detail: `Grading model returned unparseable response: ${content}`,
    };
  }
}

Two things to notice. First, temperature: 0. You want the grading model to be as consistent as possible. Non-deterministic grading defeats the purpose. Second, I'm using gpt-4.1-mini for grading, not the most expensive model available. You'll run these evals hundreds of times. With 30 test cases and 3 grading criteria, that's 90 LLM calls per run. At mini pricing it's pennies, but use a big model for grading and you'll feel it on your bill fast.

Using it for the action items example:

const accuracyCriteria: GradingCriteria = {
  name: "accuracy",
  description:
    "Are the extracted action items accurate and present in the source text? Are any action items missing? Are any hallucinated?",
  scale: "1-5 where 1 means major hallucinations or missing items, 5 means perfectly accurate extraction",
};

const clarityCriteria: GradingCriteria = {
  name: "clarity",
  description:
    "Are the action items written clearly and concisely? Would a team member understand exactly what to do from each item?",
  scale: "1-5 where 1 is vague and confusing, 5 is crystal clear with owner and deadline when available",
};

// Run both evals
const accuracyResult = await modelGradedEval(meetingNotes, output, accuracyCriteria);
const clarityResult = await modelGradedEval(meetingNotes, output, clarityCriteria);

The rubric is everything. A vague rubric ("is this good?") gives you vague scores. A specific rubric ("are action items accurate, complete, and attributed to the correct person?") gives you scores you can actually compare across prompt versions.

Putting It Together: An Eval Runner

Now let's wire rule-based and model-graded evals into a single pipeline. Give it a set of test cases, a prompt, and your eval functions. It runs everything and spits out a score report.

import OpenAI from "openai";

const openai = new OpenAI();

interface TestCase {
  input: string;
  // optional: expected output for comparison
  expected?: string;
}

interface EvalSuite {
  name: string;
  systemPrompt: string;
  testCases: TestCase[];
  ruleEvals: ((output: string, testCase: TestCase) => EvalResult)[];
  modelEvals: GradingCriteria[];
}

async function runEvalSuite(suite: EvalSuite) {
  console.log(`\n=== Running eval suite: ${suite.name} ===\n`);

  const allResults: { input: string; evals: EvalResult[] }[] = [];

  for (const testCase of suite.testCases) {
    const response = await openai.chat.completions.create({
      model: "gpt-4.1",
      messages: [
        { role: "system", content: suite.systemPrompt },
        { role: "user", content: testCase.input },
      ],
    });

    const output = response.choices[0].message.content ?? "";
    const evals: EvalResult[] = [];

    // run rule-based evals
    for (const ruleFn of suite.ruleEvals) {
      evals.push(ruleFn(output, testCase));
    }

    // run model-graded evals
    for (const criteria of suite.modelEvals) {
      evals.push(await modelGradedEval(testCase.input, output, criteria));
    }

    allResults.push({ input: testCase.input.slice(0, 80), evals });
  }

  // print results
  for (const result of allResults) {
    console.log(`Input: "${result.input}..."`);
    for (const ev of result.evals) {
      const status = ev.passed ? "PASS" : "FAIL";
      console.log(`  [${status}] ${ev.name}: ${ev.score.toFixed(2)} ${ev.detail ?? ""}`);
    }
    console.log();
  }

  // summary
  const allEvals = allResults.flatMap((r) => r.evals);
  const avgScore = allEvals.reduce((sum, e) => sum + e.score, 0) / allEvals.length;
  const passRate = allEvals.filter((e) => e.passed).length / allEvals.length;

  console.log(`--- Summary ---`);
  console.log(`Total test cases: ${suite.testCases.length}`);
  console.log(`Average score: ${avgScore.toFixed(2)}`);
  console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
}

And here's how you'd wire it up:

const actionItemSuite: EvalSuite = {
  name: "action-item-extraction-v2",
  systemPrompt: `Extract action items from meeting notes. 
Return a JSON array of strings. Each item should include 
the owner's name and deadline if mentioned.`,
  testCases: [
    {
      input: `We need to migrate the database by Friday. 
Sarah will handle the API docs. John to review the PR.`,
    },
    {
      input: `No decisions were made. The team discussed 
the roadmap but agreed to revisit next week.`,
    },
    {
      input: `Alice will deploy v2.3 to staging by EOD Tuesday. 
Bob needs to fix the auth bug before launch. 
Carol is blocked on the design review — needs input from Dave.`,
    },
  ],
  ruleEvals: [
    (output) => evalValidJson(output),
    (output) => evalMaxLength(output, 200),
  ],
  modelEvals: [accuracyCriteria, clarityCriteria],
};

await runEvalSuite(actionItemSuite);

That second test case is sneaky. When there are no real action items, a lot of prompts will hallucinate some anyway ("Team to revisit roadmap"). Your eval catches that.

Run this pipeline against prompt version A, tweak the prompt, run it again against version B. Now you have numbers. Not vibes. Numbers.

Gotchas I Learned the Hard Way

Your eval dataset is too small. Three test cases isn't an eval suite. It's a smoke test. You need at least 20-30 diverse cases to get real signal. Include edge cases: empty input, very long input, ambiguous input, input in a different language if your users might send that. Building the dataset is the boring part. It's also where most of the value comes from.

Version your eval prompts. Seriously. Your grading rubric is code, and when you change the rubric, scores shift even if output quality stayed the same. I keep my eval criteria in version-controlled files next to the prompts they grade. When someone asks "why did our accuracy score drop on March 15th," a quick git log tells me if we changed the grading or if the model actually got worse.

Watch out for factual blind spots in model-graded evals. LLMs are surprisingly bad at catching subtle factual errors. If your output says "Python was created in 1989" (it was 1991), a model-graded eval probably won't flag it. For factual accuracy, you need rule-based checks against a known-good reference when you have one.

And finally: run evals continuously, not once. An eval suite that runs once is a demo. An eval suite that runs on every prompt change in CI is a quality gate. If the pass rate drops below your threshold, block the deploy. Treat prompt changes the same way you treat code changes.

Bonus: a cheap trick for bootstrapping your test dataset

If you don't have 30 real examples yet, use your production logs. Pull the last 100 real inputs your system received, run them through your eval pipeline, and manually check the 10 lowest-scoring outputs. Those 10 become your first golden test cases. Repeat weekly and your dataset grows from real usage, not made-up examples.

What's Next

You now have enough to build a real eval pipeline. Not a production-grade MLOps platform, but something that actually tells you whether prompt change X made things better or worse. That's the 80/20.

If you want to go deeper, look at running evals as part of CI (GitHub Actions works fine), tracking scores over time in a spreadsheet or database, and A/B testing prompts in production with eval scores as the metric instead of user clicks.

I've been building these patterns into my workflow while working on Hermes IDE (GitHub). Evals changed how I think about shipping AI features. They went from "I think this works" to "I can show you the numbers." If you're building AI into your product and you don't have evals yet, start with the rule-based ones this week. They take 20 minutes and the first time they catch a broken output before your users do, you'll be sold.

What's your eval setup look like? Still vibes-checking, or have you built something more structured? I'm always curious how other teams handle this.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.