Alex Cloudstar

Posted on Apr 15 • Originally published at alexcloudstar.com

AI Evals for Solo Developers: How to Actually Know Your AI Feature Works

#ai #devtools #productivity #indiehacking

The scariest message I ever got from a user was not a bug report. It was a compliment.

"Love the new summaries, they feel sharper this week."

I had not changed anything that week. No code, no prompts, no model version. The only thing that had changed was that a model provider had silently rolled out a minor update. That was the moment I realized something uncomfortable: I had no idea whether my AI feature was getting better or worse over time. I was just assuming. Users were telling me things, and I was interpreting those things like tea leaves.

That message sent me down a rabbit hole that eventually became a small but reliable eval system. It is the kind of thing enterprise ML teams take for granted and solo developers almost never build, because every guide on the topic assumes you have a research org, a data labeling budget, and six months of runway to get it right.

You do not. Here is what actually works when it is just you.

Why Evals Matter More for Solo Developers, Not Less

There is a common assumption that evals are overhead you add once you are big enough to afford it. The opposite is true. If you are a solo developer shipping an AI feature, evals matter more, not less, because you have fewer safety nets.

An enterprise team has support staff who catch drifting quality through ticket volume. They have product managers watching retention cohorts. They have QA teams running manual spot checks. When a model silently degrades, someone somewhere notices before the problem becomes existential.

You have none of that. You have yourself, your inbox, and a handful of users who might tell you something is wrong, or might just churn without saying a word. The first signal that your AI feature got worse might be an MRR dip six weeks after the fact. By then the damage is done.

Evals are how you trade that invisible risk for a visible one. They catch regressions when you can still do something about them. They tell you whether a prompt change actually helped before you ship it. They give you a baseline you can defend when you are trying to decide whether a new model is worth the switch.

For a solo developer, that is not academic rigor. That is survival.

What an Eval Actually Is

Let me cut through the jargon. An eval is a test that asks a simple question: given this input, did the AI produce an output that meets my quality bar?

That is it. The complexity comes from figuring out what "quality bar" means for your specific feature, and from building enough examples that one-off weirdness does not distort your view of the whole.

A traditional unit test checks whether a function returns the exact expected value. An eval is fuzzier. The output of an LLM is not deterministic and does not need to be exact to be correct. Two different summaries of the same article can both be good. Two different structured extractions can both be valid. An eval is closer to a rubric than a pass or fail checkbox.

The mistake most solo developers make is thinking they need a research-grade benchmark to start. You do not. You need ten to fifty carefully chosen examples that represent what your users actually send, a clear definition of what "good" looks like for each one, and a way to run them automatically whenever you change anything that might affect quality.

That is the minimum viable eval. Everything else is optimization.

The Three Types of Evals You Actually Need

Not all evals do the same job. The terminology in the industry is a mess, so I am going to use practical categories instead of formal ones.

Golden set evals. These are your hand-curated examples, the ones you know by heart. You have looked at the input, written down what the ideal output looks like, and will notice immediately when something breaks. Start with ten of these. Grow to fifty if your product grows. This is your early warning system.

Regression evals. These catch the thing you just fixed. Every time you fix a bug or handle a weird edge case, save that input and the correct output to your regression set. It is easy for prompt changes to re-break something you fixed three weeks ago, and regression evals are the only cheap way to notice.

Production sample evals. Pull a small random sample of real production traffic, anonymize it, and run it through your current setup alongside any proposed change. This tells you whether the change helps or hurts on real-world data, not just the inputs you dreamed up. This is where most solo developers get uncomfortable because it requires being honest about what their users actually send, and what their users actually send is rarely what the pristine golden set looks like.

Each type catches different failures. Golden sets catch obvious quality drops. Regression evals catch old bugs coming back. Production samples catch the gap between what you think your users do and what they actually do.

How to Build Your First Golden Set in One Afternoon

If you have never built an eval, the hardest part is starting. Here is a sequence that gets you to a working golden set in about three hours.

Open a spreadsheet or a JSON file. Pick your highest-value AI feature. The one where quality degradation would hurt the most. Everything else can wait.

Write down ten real inputs that represent the diversity of what your feature handles. Not ten variations of the same thing. Ten genuinely different cases. A short input, a long one, a messy one, a clean one, an ambiguous one, a case where the right answer is "I do not know," an edge case you have seen fail before, a common case, a rare case, a tricky case.

For each input, write down what a good output looks like. You do not need to write the exact expected output. You need to write the criteria that make an output acceptable. "Summary must mention the refund policy. Must not exceed three sentences. Must not invent details not present in the source."

Save this as structured data. Your format can look something like this:

type EvalCase = {
  id: string;
  input: string;
  criteria: {
    mustInclude?: string[];
    mustNotInclude?: string[];
    maxLength?: number;
    customCheck?: (output: string) => boolean;
  };
  notes?: string;
};

You now have something you can run. It will not be perfect. It will catch a surprising amount of real problems. This is the lowest-effort version of evals that actually moves the needle, and it is more than ninety percent of solo developers have in production today.

Running Evals: Code Checks vs LLM-as-Judge

Once you have test cases, you need a way to score the outputs. There are two main approaches and you should use both, because each catches things the other misses.

Code-based checks are deterministic assertions about the output. Does it contain the required phrase? Is it under the length limit? Does it parse as valid JSON? Does it include a specific field? These are cheap, fast, and unambiguous. They run in milliseconds, cost nothing, and never lie about what they found.

The limitation is that many quality attributes are not reducible to a string match. "Is this summary faithful to the source?" is not a deterministic check. "Is this response helpful?" is not a deterministic check.

LLM-as-judge evaluations use a language model to score outputs against your criteria. You send the input, the output, and your rubric to a model and ask it to produce a score and a justification. This approach handles the fuzzy judgment calls that code cannot make.

async function judgeOutput(
  input: string,
  output: string,
  rubric: string
): Promise<{ score: number; reasoning: string }> {
  const result = await anthropic.messages.create({
    model: 'claude-haiku-4-5-20251001',
    max_tokens: 512,
    system: 'You are an evaluator. Score outputs strictly against the rubric. Return JSON: {"score": 1-5, "reasoning": "..."}',
    messages: [{
      role: 'user',
      content: `Input:\n${input}\n\nOutput:\n${output}\n\nRubric:\n${rubric}`
    }]
  });

  return JSON.parse(extractText(result));
}

The limitations of LLM-as-judge are real and worth knowing. Judges have their own biases. They tend to prefer longer outputs. They can be inconsistent across runs. They are not free, which matters if you are running hundreds of eval cases on every commit. Use a cheaper model as the judge than the one producing the output, and pair it with code checks for anything deterministic.

The right setup combines both. Code checks handle the strict rules. LLM-as-judge handles the subjective ones. You get fast, cheap verification of the hard constraints and slower, more expensive verification of the soft ones.

The Eval Loop: Making This Part of Your Workflow

A golden set that lives in a spreadsheet you open twice a year is not an eval system. It is a graveyard. For evals to matter, they have to run automatically when it counts.

There are three moments when your evals need to fire. Before you merge a change to your AI prompts or logic. After a model provider releases an update you are about to adopt. And on a daily schedule that catches silent provider drift even when you have not changed anything.

The last one is the most important and the most neglected. Model providers update their models without telling you. Behavior changes. Scores on your golden set shift. If you are not running evals on a schedule, you will miss this entirely.

Here is a minimal daily runner:

async function runGoldenSet() {
  const results = await Promise.all(
    goldenSet.map(async (testCase) => {
      const output = await productionAIFeature(testCase.input);
      const codeScore = runCodeChecks(output, testCase.criteria);
      const judgeScore = await judgeOutput(
        testCase.input,
        output,
        testCase.criteria.rubric ?? ''
      );

      return {
        id: testCase.id,
        codePassed: codeScore.passed,
        judgeScore: judgeScore.score,
        output,
      };
    })
  );

  const summary = aggregate(results);
  await logEvalRun(summary);

  if (summary.regressionDetected) {
    await alertMe(summary);
  }
}

Wire this into a scheduled function that runs once a day. Log the results. Alert yourself if the aggregate score drops below a threshold. This pairs naturally with the kind of production observability setup every solo developer running AI in production should have by now.

The point is not to achieve a perfect score every day. The point is to notice the day your score drops without warning, because that day tells you something upstream of you has changed.

Evals for Different Types of AI Features

The shape of your evals changes depending on what your AI feature actually does. A one-size-fits-all rubric does not work because different tasks have different failure modes.

Generative text features. Summaries, drafts, rewrites, explanations. The failure modes are hallucination, length violations, missing key information, tone mismatches, and factual drift from the source. Your rubric should check faithfulness to the source, length, required fields, and forbidden patterns. LLM-as-judge is essential here. Code checks can catch structural issues but not faithfulness.

Structured extraction features. Pulling fields out of messy documents, classifying inputs, extracting entities. The failure modes are missing fields, wrong types, hallucinated values, and format violations. Code checks do most of the work here. Define a schema, validate against it, and check that extracted values actually appear in the source where appropriate.

Conversational features. Chatbots, multi-turn agents, support assistants. The failure modes are context loss across turns, hallucinated capabilities, unsafe responses, and going off-topic. Evals for conversational features need multi-turn test cases, which are harder to construct. Start with single-turn cases covering the most common first-turn queries and grow from there.

Agent workflows. AI systems that call tools, chain actions, or make decisions. This is the hardest category to eval. Failures happen in the middle of a trace, not just in the final output. You need trace-level evaluation that checks whether individual tool calls were correct, not just whether the final result looks reasonable. This matters a lot for anyone building with the patterns from agentic coding, because those agents can fail silently in ways a final-output eval will never catch.

The more stateful and tool-heavy your feature is, the more investment your evals require. If you are doing basic text generation, you can get away with a small golden set and simple judges. If you are running multi-step agents, you need trace-level evaluation infrastructure, and the bar for rolling your own goes up.

What to Do When Your Evals Fail

Building evals is half the work. Acting on them is the other half. The first time your eval suite flags a regression, you will be tempted to assume the eval is wrong. Sometimes it is. Usually it is not.

The response loop that works is boring but effective. When an eval fails, look at the specific case that failed and read the output with fresh eyes. Is the output actually bad, or is your rubric too strict? If the output is bad, figure out what changed. Was it a prompt you edited? A model version update? A library upgrade that affected your parsing? A production data shift that your golden set did not capture?

If the rubric was too strict, fix the rubric. Evals are living documents. Writing them once and never updating them is how they drift from reality.

If the output is genuinely worse, diagnose the cause before you change anything. The temptation will be to immediately edit the prompt to recover the lost quality. Resist that until you actually understand what broke. Blind prompt editing under pressure is how you paper over one failure with two new ones.

Keep a log of eval failures and their resolutions. Over time this log becomes more valuable than the evals themselves. It tells you what your system is most likely to fail at, which tells you where to focus the next round of improvements.

The Cost of Evals, and How to Keep It Sane

Evals cost money. If you run a fifty-case golden set through an expensive model with LLM-as-judge scoring every day, you will burn through a surprising amount of API budget on test infrastructure alone. For a solo developer, this matters.

A few practical moves keep eval costs from spiraling. Use a cheaper model as the judge than as the generator. A premium model producing outputs does not need a premium model scoring them. A mid-tier or small model handles judgment well enough for most cases.

Run full eval suites on schedule, not on every commit. Daily or even twice-daily runs are usually enough. You do not need eval feedback in thirty seconds. You need it before silent drift becomes a user complaint.

Cache the static parts of your judge prompts. Your rubric does not change across test cases. The rubric plus the judging instructions are the perfect candidate for prompt caching, which pairs with the patterns from the LLM cost optimization playbook. Cache hits on the judge prompt alone can cut eval costs by more than half.

Cap your golden set size to what you can afford to run daily. Fifty cases is usually plenty. One hundred is the ceiling for most solo projects. If you feel like you need a thousand cases to be confident, you probably do not need a thousand. You need better cases.

What Not to Do

The wrong way to approach evals as a solo developer is to try to replicate what a research team does. You do not need leaderboards. You do not need formal statistical significance testing. You do not need to pay a labeling service to annotate ten thousand examples. All of that is appropriate if you are running an AI-first company with investors breathing down your neck. It is absolutely not appropriate if you are one person shipping a feature inside a small SaaS.

Do not build a custom evaluation framework from scratch before you have ten test cases. Start with a spreadsheet or a JSON file. Graduate to a real tool only after you outgrow the simple version.

Do not let perfect be the enemy of started. A rough eval suite run daily is infinitely more useful than a perfect eval suite still in planning.

Do not forget to run your evals before you change providers or model versions. The moment you are most likely to introduce a silent regression is the moment you are also most excited about a new model. That excitement is exactly when your evals earn their keep.

And do not use evals to justify decisions you already made. Run them honestly and let the results surprise you. The whole point is to hear things you were not expecting.

The Solo Developer Playbook: A 14-Day Rollout

If you have an AI feature in production and no evals today, here is a practical path to a working system in two weeks without derailing your roadmap.

Days one to three: pick the feature and write ten golden cases. Spend one afternoon on this. Do not try to cover every edge case. Cover the cases that matter most to your users.

Days four to six: build a runner and score manually. Get a script that runs your ten cases against your current setup and produces a pass or fail for each. Score them yourself. This gives you a baseline.

Days seven to nine: add code checks and LLM-as-judge. Automate the scoring. Pick a cheap model for the judge. Test that your automated scores roughly match your manual scores.

Days ten to twelve: wire it into a daily scheduled run. Set up logging and alerts. Run it for a few days and confirm it is stable.

Days thirteen and fourteen: add your first regression cases. When the next bug report comes in, save that input and expected output as a regression case. Start building the habit.

By the end of two weeks, you have a working eval system that catches silent regressions, protects you from provider drift, and gives you a real signal when you are deciding whether to ship a prompt change. That is worth a lot more than it cost.

The Honest Bottom Line

Nobody builds AI features hoping they will quietly get worse over time. But without evals, that is the default trajectory, and you will never see it happen until it is too late.

The gap between having no evals and having a basic working eval suite is one afternoon of focused work. The gap between a basic eval suite and a sophisticated one is months of optimization that most solo developers do not need. Start with the afternoon. Build the basic version. Run it daily.

The developers who take AI quality seriously do not do it because they have time or budget to spare. They do it because they have been burned by the alternative. Shipping an AI feature without evals is shipping something you cannot measure, and you cannot fix what you cannot see.

You are not too small to have evals. You are the exact size where they matter most.

DEV Community