RAXXO Studios

Posted on Apr 19 • Originally published at raxxo.shop

Running LLM Evals in Production: The 2026 Guide

#ai #productivity #claudecode #automation

Most LLM apps ship with zero evals because engineers confuse demos with quality
Three eval types matter: regression, rubric-based, and live production sampling
LLM-as-judge works if you version the grader prompt and audit it monthly
The 2026 toolchain is promptfoo for local, Braintrust for team, PostHog for live traffic
Run a 30-example golden set in CI and sample 1 percent of production for under 5 EUR a month

I shipped my first Claude-powered feature in 2024 with zero evals. It looked great in the demo and broke in production the same week. A prompt tweak that helped one case silently regressed five others. I only found out when a user emailed me a screenshot.

Running LLM evals in production is the difference between a prototype and a product. Most teams skip evals because the blog posts make it sound like a six-month research project. It does not have to be. This is the setup I use on every RAXXO tool that touches a language model, and the same setup works for a solo indie dev or a small team.

Why LLM Evals Matter More in 2026 Than They Did in 2024

In 2024 you could get away with vibes. Models were weaker, prompts were shorter, and features were simpler. A human could eyeball ten outputs and call it a day. That does not scale to 2026.

Three things changed. Models got better at looking correct while being wrong. Agents started chaining five or six calls together, so a small error compounds. And the pricing collapsed, which means you ship more LLM features, which means you have more surface area to regress.

I have a rule now. If a feature uses an LLM and I cannot show a scored evaluation of its quality, the feature is not shippable. Not because I am a purist, but because every time I skipped this rule I got burned. A prompt change that felt smart at 2am broke three customer-facing flows by morning. I had no way to know because I had no baseline.

The primary keyword here is simple. Running LLM evals in production is not testing. Tests check that code runs. Evals check that output is good. Those are different problems and they need different tools.

The Three Eval Types Every LLM App Needs

There is a lot of taxonomy in the eval world. Pairwise comparisons, reference-based, reference-free, heuristic, model-graded, human-graded, the list keeps growing. Ignore 80 percent of it. In practice you need three things.

Regression evals. A fixed set of 20 to 50 example inputs with expected outputs or expected behaviors. You run this every time you change the prompt, the model, or the parsing logic. If the score drops, you know before you ship. My regression set for the RAXXO content generator has 34 examples. It takes 90 seconds to run and costs about 8 cents per run with Claude Haiku. I run it on every pull request.

Rubric-based evals. For open-ended outputs where there is no single right answer, you write a rubric. Things like "does the response stay in first-person voice," "does the response cite a specific number," "is the response under 150 words." A grader model reads the output and scores each criterion from 0 to 1. This is how I catch voice drift in the humanizer skill. If the rubric score for "sounds human" drops below 0.8, the prompt gets reverted.

Production sampling. Real traffic is where the weird cases live. I sample 1 percent of production calls, run them through a grader, and log the score to PostHog. Once a week I look at the lowest-scoring outputs. That is where I find the failure modes I never thought to write a regression for. Every one of those becomes a new regression example, and the set grows.

If you only do one of the three, do regression. If you can afford two, add production sampling. Rubric evals come third because they need the most work to design properly.

LLM-as-Judge: The Trick Everyone Gets Wrong

The default way to score open-ended outputs in 2026 is to ask another LLM to grade them. This is fine. It is also where most teams mess up.

The mistake is treating the grader prompt like a throwaway. People write "rate this response from 1 to 10" and move on. Then six months later the scores drift, the prompt changed three times, and nobody can explain why the quality chart is going up when users are clearly complaining more.

Three rules that have saved me embarrassment:

The grader prompt is versioned and stored next to the regression set. If you change the grader, you rerun the full history. Otherwise you cannot compare last month to this month.

The grader outputs structured scores, not a number between 1 and 10. Scores on a 10-point scale are basically random. Scores on a binary "passes this criterion yes or no" are reliable. I use 0 or 1 for each rubric line, then sum.

Once a month, I audit the grader by hand-scoring 20 random examples and comparing. If my score and the grader score agree less than 80 percent of the time, the grader is broken, and any quality metric built on it is noise.

This last one is what separates real evals from theater. Everyone talks about it. Almost nobody does it. It takes 20 minutes a month and prevents quarters of lost work.

The sneaky failure mode here is that graders get worse when the model under test gets better. A stronger model writes more elaborate outputs, and a weaker grader starts rewarding length over quality. I caught this on my humanizer skill when I upgraded from Sonnet 4.4 to 4.6 and the rubric score jumped 15 points with no real change in output quality. The grader was impressed by longer sentences. That was the moment I started versioning grader prompts in git.

The 2026 Eval Toolchain I Actually Use

I have tried most of the paid tools. Braintrust, Humanloop, LangSmith, Phoenix, Arize, Galileo. They all work. Most are overkill for a solo builder or a small team.

Here is what I actually run on RAXXO projects, ranked by dollar-per-day:

Promptfoo. Free, open source, runs in your terminal, works with every major model provider. This is my default. Regression evals live as YAML files in the repo. One command, promptfoo eval, runs the whole set and dumps a diff against the last run. If you are just starting, start here.

PostHog for live traffic. I send every LLM call as a PostHog event with input, output, model, cost, and latency. Then I attach the grader score as a separate event. PostHog has dashboards out of the box and cheap enough that my eval logging costs under 3 EUR a month on modest traffic.

Braintrust when I am working with a collaborator. The moment two people need to look at eval results together, shared dashboards pay for themselves. Their free tier handles the small RAXXO projects. I hit the paid tier when I started running daily evals on the humanizer skill.

Claude Haiku as the default grader. Haiku 4.5 is cheap, fast, and good enough for most rubric scoring. I only reach for Sonnet when the rubric requires nuance, which is rare. Running a 50-example rubric eval with Haiku is under 5 cents.

Things I tried and dropped. LangSmith felt over-engineered for solo work. Humanloop's UI is nice but I never used the team features. Anything that requires me to rewrite my code to use their SDK is a non-starter. Promptfoo runs against my existing code without modification, which is why it stuck.

Running Evals in CI Without Going Broke

The cost question scares people off evals more than the setup question. Here is the honest math from my actual projects.

Regression set of 34 examples against Claude Haiku. 8 cents per run. I run it twice a day on average (once per PR, once on merge). 16 cents a day, about 5 EUR a month.

Rubric evals on a 20-example set with Haiku as grader and Sonnet as the model under test. About 12 cents per run. I run it once a day on main. 4 EUR a month.

Production sampling at 1 percent of traffic, scored with Haiku. Varies with traffic but for the RAXXO tools this is under 3 EUR a month.

Total eval budget across every LLM feature I ship is under 15 EUR a month. The cost of not having evals is one bad prompt change that drives a customer to a competitor. I paid that cost twice. 15 EUR is cheap insurance.

The CI setup is boring, which is the point. GitHub Actions runs promptfoo eval on every pull request, uploads the result as an artifact, and fails the build if the regression score drops more than 5 points. That is it. No magic. No vendor lock-in. The YAML is in the repo next to the code.

One thing I learned the slow way. Do not run the full production-sampling eval in CI. That belongs in a cron job. CI should be fast and deterministic, and live traffic is neither. Keep the two pipelines separate.

What To Do This Week

If you have an LLM feature in production and no evals, do this in order. It takes one afternoon.

Write 20 example inputs and expected outputs for your most important flow. Save them as a YAML file. Run promptfoo against your current prompt. That is your baseline.

Pick a grader model. Haiku is my default. Write a rubric of 3 to 5 criteria. Run it against the same 20 examples. Save the score.

Set up GitHub Actions to run both on every pull request. Fail the build if either score drops.

Add one PostHog event to your LLM calls. Log input, output, model, and cost. You will not use it immediately. You will be grateful in three months when you need to debug a production regression.

That is the whole system. Four hours of work. Under 15 EUR a month. It catches about 80 percent of the prompt regressions I used to ship by accident.

A few gotchas worth naming before you start. Do not let your regression set balloon past 100 examples in the first month. Every example you add you have to maintain when the expected behavior changes, and beginners always over-collect. Keep it tight, keep it relevant, retire examples when they stop finding bugs. Do not seed your set entirely from happy-path cases. The most valuable examples are the ones where the model used to fail. Save customer complaints, failed outputs, and weird edge cases. Those are the examples that actually move the score.

And do not let the eval score become a vanity metric. I have seen teams celebrate a rubric score going from 0.82 to 0.87 when the underlying product is worse. The score is a proxy. The rubric is a proxy. The thing you care about is whether users get value, and no eval system measures that directly. Read the low-scoring outputs every week. Read a sample of the high-scoring outputs too. The number by itself lies.

Bottom Line

Evals are not optional anymore. The models are too good at looking correct while being wrong. You need a fixed regression set, a versioned grader, and a 1 percent sample of live traffic. Start with promptfoo and Haiku. Add PostHog for production. Graduate to Braintrust if and when you have a teammate. Spend 15 EUR a month and save yourself a bad week.

The teams that ship reliable LLM features in 2026 are not smarter. They just have a baseline they trust. Build yours this week.