Akram Bakhouche

Posted on May 28 • Originally published at bak-dev.com

How I evaluate Claude SDK features before shipping them to production

#claudesdk #aiagents #testing #promptengineering

The riskiest line of code in your codebase is not the database transaction or
the third-party API call. It's the LLM prompt — because when it silently
regresses, nothing visible breaks. The endpoint still returns 200. The
JSON still parses. Test suite still green. The model just started giving
worse answers, and you find out from a customer.

I have shipped enough Claude SDK features in production to be afraid of this.
Here is the eval harness pattern I now ship with every LLM-powered feature.
It takes about an hour to build per feature and has saved me from at least
three subtle regressions in Career-OS alone.

The risk you can't see

A normal regression looks like this: you change a function, a test fails,
you fix it.

An LLM regression looks like this: you tweak the system prompt to fix a tone
issue on Monday. The fix works. Three weeks later, your scoring drift down by
12% on a class of inputs you didn't think about. No test failed. No alert
fired. You only notice because a downstream metric (apply-rate, conversion,
support-ticket close time) is off, and that's if you're watching.

The four common silent regressions I have personally hit:

Tweaking the system prompt broke a tone you weren't testing. You added "be friendly" and lost "be precise about prices."
A new Anthropic model version (e.g. Sonnet 4.5 → 4.6) calibrated differently. Same prompt, same input, different score by 8 points.
A library upgrade silently dropped a parameter (max_tokens, temperature). Defaults kicked in. Outputs got longer and looser.
Cache invalidation drift. You added a field to the system prompt; the cache hash changed; cost per call jumped 6× before you noticed.

All four are invisible to unit tests. All four are caught by a single eval
harness that takes a few hours to build.

The pattern

You need three things:

A fixtures file — frozen inputs with expected outputs.
A scorer — measures how far each actual output drifted from expected.
A runner — replays every fixture through the live model, fails the build if any fixture drifted past tolerance.

That's it. No fancy framework. Anything more is YAGNI.

Step 1 — Fixtures

Pick 8–15 hand-labeled cases. Cover the edge cases you actually care about,
not just the happy path. For a job-fit scorer, mine looks like:

// tests/fixtures/scored_jobs.jsonl

{"id":"f001","input":{...},"expected":{"fit":88,"angle":"strong fullstack + AI"},"tolerance":{"fit":5}}
{"id":"f002","input":{...},"expected":{"fit":35,"angle":"junior role, decline"},"tolerance":{"fit":5}}
{"id":"f003","input":{...},"expected":{"fit":72,"angle":"good freelance fit"},"tolerance":{"fit":7}}
{"id":"f004","input":{...},"expected":{"fit":15,"angle":"crypto, disqualify"},"tolerance":{"fit":3}}
// …

Per-fixture tolerance is the move I most often see skipped. Different inputs
have different acceptable variance. A clear "strong fit" might tolerate ±5
points. A clear "disqualify" should be tight — ±3 — because we want certainty
on rejection. A borderline case might tolerate ±10 because it's genuinely
ambiguous.

Label them yourself. Don't generate fixtures with another LLM. The whole
point is that you, the human, know what right looks like.

Step 2 — Scorer

For structured-output features, the scorer is plain arithmetic.

# tests/eval/score_fit.py

def score(actual: dict, expected: dict, tolerance: dict) -> EvalResult:
    fit_delta = abs(actual["fit"] - expected["fit"])
    fit_ok = fit_delta <= tolerance.get("fit", 5)

    angle_overlap = jaccard(
        tokenize(actual["angle"]),
        tokenize(expected["angle"]),
    )
    angle_ok = angle_overlap >= 0.3

    return EvalResult(
        passed=fit_ok and angle_ok,
        fit_delta=fit_delta,
        angle_overlap=angle_overlap,
    )

For free-form outputs (cover letters, descriptions), don't try to match the
text exactly. Score for constraints:

Did it use the requested language?
Did it stay under the word count?
Did it mention every required fact?
Did it avoid the forbidden phrases?

These are deterministic checks against the output text. They're easier to
write than people fear. For Career-OS's outreach drafter:

def score_outreach(actual: str, expected: dict) -> EvalResult:
    return EvalResult(
        passed=all([
            language_matches(actual, expected["language"]),
            word_count(actual) <= expected["max_words"],
            all(fact in actual for fact in expected["required_facts"]),
            not any(phrase in actual.lower() for phrase in expected["forbidden_phrases"]),
        ]),
    )

You can also do an "LLM-as-judge" pattern for fuzzy criteria — but only as a
last resort. It's slower, more expensive, and adds another silent-regression
surface (the judge model can drift too). Use deterministic checks whenever
the criterion is decidable.

Step 3 — Runner

A 30-line script that loops fixtures, calls the feature live, scores each one,
prints a summary table.

# tests/eval/run.py

import json
from pathlib import Path
from rich.table import Table
from rich.console import Console

from career_os.scorer import score_job
from tests.eval.score_fit import score as eval_one

console = Console()
fixtures = [json.loads(line) for line in Path("tests/fixtures/scored_jobs.jsonl").read_text().splitlines() if line.strip()]

table = Table(title="Eval results")
table.add_column("id"); table.add_column("expected_fit"); table.add_column("actual_fit"); table.add_column("delta"); table.add_column("status")

failed = 0
for fx in fixtures:
    actual = score_job(fx["input"])           # the live call
    res = eval_one(actual.dict(), fx["expected"], fx["tolerance"])
    status = "[green]PASS" if res.passed else "[red]FAIL"
    if not res.passed: failed += 1
    table.add_row(fx["id"], str(fx["expected"]["fit"]), str(actual.fit), str(res.fit_delta), status)

console.print(table)
console.print(f"\n[red]{failed}[/red] failed of {len(fixtures)}" if failed else f"[green]all {len(fixtures)} passed[/green]")
exit(1 if failed else 0)

The exit(1 if failed else 0) is what makes this an actual test, not just a
debugging tool. Wire it into CI.

Wiring it into CI

# .github/workflows/eval.yml

name: eval
on:
  pull_request:
    paths: ['src/career_os/scorer/**', 'tests/fixtures/**']
  workflow_dispatch:

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - run: pip install -e ".[dev]"
      - run: python tests/eval/run.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

This runs the eval on every PR that touches the scorer or the fixtures. It
costs a few cents per run. It catches all four silent-regression categories.

When to extend the harness

You will hit cases the eval can't catch. That's fine — every catch is a new
fixture. The discipline:

Bug reported in production? Add a fixture that reproduces it. Fix. Eval now catches the regression next time.
You hand-tuned the prompt for 3 hours? Save 5 of those tuning cases as fixtures with the now-correct expected outputs.
Anthropic ships a new model? Run the eval against the new model before bumping it in production. The delta tells you whether the new model is a free upgrade or a calibration redo.

The fixtures grow with the system. After a year, you have a regression suite
that encodes most of what you know about how the feature should behave —
and any future engineer (or future-you, three months from now) can change
the prompt with confidence because the eval will catch them if they break it.

What this is worth to a client

Most teams shipping LLM features today do not have this. They ship a Claude
call, watch metrics, hope. When the system slowly degrades, they can't tell
whether it was the prompt change, the model bump, the user-input distribution
shift, or something else entirely. Triage time on a silent LLM regression in
production is measured in days.

A 4-hour investment in an eval harness collapses that to minutes. If you
ship a Claude feature without one, you are building technical debt you can't
see and can't measure.

If you have a Laravel / PrestaShop / Python app shipping an AI feature and
nobody has built the eval harness for it yet, that is a 1-week scoped
engagement I take on. The shape is on the hire-me page.

For the full architecture context where this pattern lives — including the
fixture file format and the eval runner — see the
Career-OS architecture walkthrough. The same
discipline applies to the Laravel patterns in
5 places to bolt AI and the
PrestaShop module in
the 5-file pattern — every Claude
call in those posts is a place where the eval harness pattern earns its
keep.

Originally published on bak-dev.com. Find more build-in-public posts at bak-dev.com/blog.

Top comments (2)

Adam Lewis • May 30

Per-fixture tolerance is the one I see most teams skip when they first build an eval suite. A rejection case at fit=15 with a tolerance of ±3 is doing real work. The same suite running ±10 across the board tells you almost nothing about whether anything moved.

The "label them yourself" rule lands the same way for me. The moment another model writes the expected, the harness is measuring agreement between two models, which is a different question to the one you wanted answered. Same applies to using the production model as its own judge for the borderline cases.

Harjot Singh • May 31

Having an actual evaluation gate before a new SDK feature hits production is the discipline most teams skip, and it's exactly why their AI features are flaky - they ship the demo. A new model capability or SDK feature can look great in a notebook and quietly regress your real workload (latency, cost, edge-case behavior, output format drift), so a repeatable eval harness with representative cases, a cost/latency budget, and a regression baseline is what separates "we tried it" from "we shipped it safely." The hard part is having good eval cases that reflect production, not cherry-picked happy paths.

This is squarely the worldview I build from - don't trust a capability, verify it against your real cases before you rely on it. It's core to Moonshift, the thing I work on: a multi-agent pipeline that takes a prompt to a deployed SaaS, where a verify layer evaluates each step against expected behavior rather than trusting the model's output. Same instinct as your pre-prod eval, just baked into the runtime. Multi-model routing keeps a build ~$3 flat, first run's free no card. Really aligned post. What does your eval set look like - frozen golden cases with assertions, or LLM-as-judge scoring? And do you gate on cost/latency regression too, or mostly correctness? The cost regression is the one that silently kills you at scale.