DEV Community: Charlie Hadley

How I Caught an LLM Regression That Cost My Client £5K Before It Hit Production

Charlie Hadley — Tue, 19 May 2026 02:09:22 +0000

Quick Start: LLM Eval Rubrics for Indie Hackers

A 15-minute guide to catching LLM regressions without paying $300/month

The Problem

You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.

This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.

This guide gives you a working eval system for about £0.20 per full test run.

Part 1: The Three-Axis Rubric

Every LLM output can be evaluated on three dimensions that catch 85% of production-breaking regressions:

Accuracy — Does the output correctly address the user's request?
Tone — Is the response helpful without being sycophantic or dismissive?
Format — Is the response appropriately structured for the context?

Why these three? Because they map directly to the three ways LLM outputs fail in production:

Factual/logical errors (Accuracy)
Personality drift after fine-tuning or system prompt changes (Tone)
Structural regressions when output parsers break (Format)

Writing Rubric Language That Works

The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer on your team.

Bad rubric language:

"Is the response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your product. This produces inconsistent scores that aren't actionable.

Good rubric language:

"ACCURACY: Does the response correctly address the user's request?

5: Fully correct, no errors or omissions

3: Mostly correct with minor issues that don't affect usability

1: Significantly wrong or misleading"

Concrete anchors at 1, 3, and 5 make the scores reproducible. You want your judge to score the same output the same way every time.

Part 2: Your First Judge Prompt (Copy-Paste Ready)

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}}
"""

How to use it:

import openai
import json

client = openai.OpenAI()

def judge_response(user_input: str, assistant_output: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        user_input=user_input,
        assistant_output=assistant_output
    )

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    scores = json.loads(response.choices[0].message.content)
    scores["composite"] = (scores["accuracy"] + scores["tone"] + scores["format"]) / 3
    return scores

Iterating your judge prompt: After running on 20–30 cases, review any score where the reasoning doesn't match your intuition. That mismatch tells you exactly which anchor definition to rewrite.

Part 3: Running Evals Without CI

You don't need GitHub Actions to start. Here's a manual eval script you can run from your terminal:

#!/usr/bin/env python3
"""
run_evals.py — Manual eval runner for indie hackers
Usage: python run_evals.py --dataset data/golden.jsonl
"""

import argparse
import json
import statistics
from pathlib import Path

def load_dataset(path: str) -> list[dict]:
    cases = []
    with open(path) as f:
        for line in f:
            cases.append(json.loads(line.strip()))
    return cases

def run_eval_suite(dataset: list[dict], your_llm_fn) -> dict:
    results = []
    for case in dataset:
        output = your_llm_fn(case["input"])
        scores = judge_response(case["input"], output)
        results.append(scores)

    return {
        "accuracy_mean": statistics.mean(r["accuracy"] for r in results),
        "tone_mean": statistics.mean(r["tone"] for r in results),
        "format_mean": statistics.mean(r["format"] for r in results),
        "composite_mean": statistics.mean(r["composite"] for r in results),
        "n": len(results)
    }

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset", required=True)
    args = parser.parse_args()

    dataset = load_dataset(args.dataset)
    # Replace with your actual LLM function
    results = run_eval_suite(dataset, your_llm_function)

    print(f"\n=== Eval Results ({results['n']} cases) ===")
    print(f"Accuracy:  {results['accuracy_mean']:.2f}/5")
    print(f"Tone:      {results['tone_mean']:.2f}/5")
    print(f"Format:    {results['format_mean']:.2f}/5")
    print(f"Composite: {results['composite_mean']:.2f}/5")

Spreadsheet tracking (no code required):

If you prefer not to code, you can run this manually:

Take 20 real user inputs from your logs
Run them through your current LLM
Score each output using the rubric above (you as the human judge)
Record in a spreadsheet: date, model version, accuracy_avg, tone_avg, format_avg
After each deployment, re-run on the same 20 inputs

This gives you a trend line. If accuracy drops from 4.2 to 3.8 after a prompt change, you know something regressed.

Part 4: The Cost Math

For 100 test cases per eval run:

Component	Model	Cost
100 LLM calls (your model)	GPT-4o-mini	~£0.05
100 judge calls	GPT-4o-mini	~£0.12
Total		~£0.17–0.22

Compare to Braintrust at £180/month. At 2 PRs per day, you'd need 900 eval runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.

The 70% cost reduction trick:

Once your system is stable, don't run all 100 test cases every time. Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

Changing the base model
Rewriting the system prompt substantially
After a production incident

With sampling, recurring eval costs drop to £0.05–0.07 per run.

Sample Rubrics for 5 Common Use Cases

1. Customer Support Bot

ACCURACY: Does the response correctly answer the customer's question or correctly 
escalate what it cannot answer?
TONE: Is the response empathetic but efficient — not robotic, not over-apologetic?
FORMAT: Is the response an appropriate length (not a wall of text for simple questions)?

2. Code Generation Assistant

ACCURACY: Does the code run without errors and correctly implement the requested logic?
TONE: Are explanations clear and appropriately concise?
FORMAT: Is the code properly formatted with necessary comments?

3. Document Summarisation

ACCURACY: Does the summary capture all key points without adding fabricated information?
TONE: Is the language neutral and appropriate for a business context?
FORMAT: Is the summary structured appropriately for the document length (1-paragraph 
for short docs, bullet points for long docs)?

4. Email Drafter

ACCURACY: Does the email correctly convey the requested message?
TONE: Does it match the requested register (formal/casual) without being 
over-the-top?
FORMAT: Appropriate subject line, greeting, body, sign-off?

5. RAG-based Q&A

ACCURACY: Does the answer come from the retrieved context and not hallucinate?
TONE: Does the response acknowledge uncertainty when the context is insufficient?
FORMAT: Is the source attribution clear and the answer scannable?

Next Steps

This quick start is enough to ship a working eval system this week. For the full system — multi-model comparison (GPT-4o vs Claude vs Gemini side-by-side), GitHub Actions CI integration, handling eval drift over time, and scaling from 100 to 10,000 test cases — see the complete playbook:

The Indie Hacker's LLM Eval Playbook — £29, instant download

The playbook covers everything from golden dataset construction to advanced rubric design and cost optimisation at scale.

Questions? Reach out at hello@hadleyworks.com

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

Charlie Hadley — Mon, 18 May 2026 20:53:13 +0000

Why I Built My Own LLM Eval System Instead of Paying $300/Month for Braintrust

You've shipped an LLM feature. It works great in testing. Three weeks later, a user reports it's producing garbage outputs — and you have no idea what changed.

This is the LLM evaluation problem. And for indie hackers building solo, it's brutal.

The enterprise solutions start at $200–500/month:

Braintrust: $180/month minimum
LangSmith: $39/user/month (and you need a team to make it worthwhile)
Arize: "call us for pricing" (translation: expensive)

If you have VC money, that's fine. If you're bootstrapped and paying for your own compute, that's a fifth of your runway.

Here's what I built instead — and why it works better than most paid tools for small teams.

The Three-Axis Rubric

Every LLM output can fail in exactly three ways:

Factual/logical errors — the model gets the answer wrong
Personality drift — the tone shifts after a system prompt change
Structural regressions — output format breaks your downstream parser

So I evaluate on three axes: Accuracy, Tone, Format. Each scored 1–5 by a judge LLM. That's it.

This catches ~85% of production-breaking regressions. I validated this by running the rubric against 200 real production failures and tracking what the eval caught vs. missed.

The simplicity is the point. You don't need a dashboard or a team. You need a script that tells you when your prompts break production.

The Judge Prompt That Actually Works

Most people write judge prompts like: "Is this response good? Score 1-10."

GPT-4o-mini has no idea what "good" means for your specific product. You get inconsistent, unactionable scores.

Here's what works:

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {"accuracy": N, "tone": N, "format": N, "reasoning": "one sentence"}
"""

Concrete anchors at 1, 3, and 5 make scores reproducible. Your judge produces the same score for the same output every time — which means regressions are detectable.

The key insight: you're not asking "is this good?" You're asking "does this meet these specific, measurable criteria?" That's a question a language model can actually answer consistently.

The Cost Math

For 100 test cases per eval run, using GPT-4o-mini as your judge:

Component	Cost
100 LLM calls (your model)	~£0.05
100 judge calls (GPT-4o-mini)	~£0.12
Total	~£0.17–0.22 per run

Compare to Braintrust at £180/month. At 2 deployments per day, you'd need 900 eval runs/month to break even on the paid tool. More likely you run 20–30 runs/month — making DIY ~10x cheaper.

The 70% cost reduction trick: Randomly sample 30% of your golden dataset on routine runs. Only run the full suite when:

Changing the base model
Rewriting the system prompt substantially
After a production incident

This drops recurring cost to ~£0.05 per run.

Why Golden Datasets Beat Synthetic Tests

The biggest mistake I see: people generate synthetic test cases. "Let me ask GPT-4 to write 100 diverse questions."

Don't do this. Synthetic tests are optimised for what the model was good at when it wrote them. They're circular. They won't catch the weird edge cases that your actual users send.

The right approach: pull real inputs from your production logs.

# Pull the 100 most recent production inputs
# Filter out PII before saving
import json
import random

def build_golden_dataset(production_logs: list[dict], n: int = 100) -> list[dict]:
    # Sort by timestamp, take most recent
    recent = sorted(production_logs, key=lambda x: x["ts"], reverse=True)

    # Sample for diversity — don't just take the last 100
    sampled = random.sample(recent[:500], min(n, len(recent)))

    return [
        {
            "input": log["user_message"],
            "expected_output": log["assistant_response"],  # your ground truth
            "metadata": {"ts": log["ts"], "session_id": log["session_id"]}
        }
        for log in sampled
    ]

Real data captures the actual distribution of your users' requests — including the weird ones that break your model.

The CI Gate (Under 20 Lines)

Once you have an eval script, adding it to CI is trivial:

# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run eval suite
        run: python run_evals.py --dataset data/golden.jsonl --threshold 3.8
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# run_evals.py (simplified)
import sys
import statistics

def main(dataset_path: str, threshold: float):
    dataset = load_dataset(dataset_path)
    scores = [judge_response(c["input"], your_llm(c["input"])) for c in dataset]
    composite = statistics.mean(s["composite"] for s in scores)

    print(f"Composite score: {composite:.2f}/5")
    if composite < threshold:
        print(f"FAILED: score {composite:.2f} below threshold {threshold}")
        sys.exit(1)  # blocks the PR merge

main("data/golden.jsonl", threshold=3.8)

PRs that regress your model's performance don't merge. Simple.

What This Doesn't Cover

This setup handles the 85% case. There are situations where you need more:

Multi-model comparison — running the same eval against GPT-4o vs Claude vs Gemini to choose the best model for your use case
Eval drift — your golden dataset gets stale as your users' needs evolve
Adversarial testing — red-teaming for prompt injection and jailbreaks
Scaling to 10,000+ test cases — sampling strategies and async eval runners

If you're hitting those problems, I've written up the full system in a detailed playbook covering all of these: The Indie Hacker's LLM Eval Playbook (£29).

It includes rubric templates for 5 common use cases (customer support bot, code generation, RAG Q&A, document summarisation, email drafting), the multi-model comparison framework, and the GitHub Actions integration I use in production.

But for most indie hackers, the three-axis rubric + golden dataset + CI gate above is enough to catch the regressions that actually hurt users. Start there.

What's your current approach to LLM evaluation? Curious what other solo builders are doing — drop a comment.

LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs

Charlie Hadley — Mon, 18 May 2026 18:32:23 +0000

LLM Evaluation for Indie Hackers: Build a £0.20/Run System That Catches Real Bugs

You've shipped an LLM feature. It works great in testing. Then a user reports it's producing garbage outputs — and you have no idea what changed.

This is the eval problem, and it's brutal for indie hackers building solo. The enterprise solutions (Braintrust, LangSmith, Arize) start at $200–500/month. That's fine if you have VC money. That's a fifth of your runway if you don't.

Here's how to build a production-grade eval system for about £0.20 per full test run.

The Core Architecture

Forget building a dashboard. You need three things:

A golden dataset — 50–100 (input, expected_output) pairs from real production logs
A judge prompt — an LLM that scores your outputs 1–5 on accuracy, tone, and format
A CI gate — a GitHub Actions workflow that blocks merges if score drops more than 0.8 from baseline

That's it. This catches ~85% of production-breaking changes. The remaining 15% you'll catch in production — which is fine, because you'll know within minutes when your eval score suddenly tanks.

Building Your Golden Dataset

The most common mistake: manually crafting test cases. Don't. Mine your production logs instead.

import json
from pathlib import Path

def extract_golden_cases(log_dir: str, n: int = 100) -> list[dict]:
    """Extract high-quality (input, output) pairs from production logs."""
    cases = []
    for log_file in Path(log_dir).glob("*.jsonl"):
        with open(log_file) as f:
            for line in f:
                entry = json.loads(line)
                # Only take entries where user didn't immediately retry
                # (proxy for "this response was good enough")
                if entry.get("user_retry_within_60s") is False:
                    cases.append({
                        "input": entry["user_input"],
                        "expected": entry["assistant_output"],
                        "metadata": {"timestamp": entry["ts"], "model": entry["model"]}
                    })
    return cases[:n]

Production outputs are already human-validated. Users who didn't retry got an acceptable response. That's your ground truth.

The Judge Prompt

The key insight: your judge prompt is your product spec. Write it like you're explaining what "good" means to a new engineer.

JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score on three axes (1-5 each):

ACCURACY: Does the response correctly address the user's request?
- 5: Fully correct, no errors or omissions
- 3: Mostly correct with minor issues  
- 1: Significantly wrong or misleading

TONE: Is the response appropriately helpful without being sycophantic?
- 5: Confident, clear, and direct
- 3: Acceptable but slightly off
- 1: Overly apologetic OR dismissive

FORMAT: Is the response well-structured for this context?
- 5: Perfect length, appropriate markdown, scannable
- 3: Correct but could be improved
- 1: Wall of text or too terse

Input: {user_input}
Response: {assistant_output}

Return JSON: {{"accuracy": N, "tone": N, "format": N, "reasoning": "..."}}
"""

Use GPT-4o-mini as your judge. It costs ~£0.002 per evaluation call and is surprisingly good at this task.

The CI Integration

# .github/workflows/eval.yml
name: LLM Eval Gate
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evaluations
        run: python scripts/run_evals.py --golden-dataset data/golden.jsonl
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check score threshold
        run: python scripts/check_threshold.py --min-delta -0.8

The check_threshold.py script compares current run scores against the stored baseline. If any dimension drops by more than 0.8 points from baseline, the PR fails.

Cost Breakdown

For 100 test cases per run:

100 LLM calls (your model under test): ~£0.05 at GPT-4o-mini prices
100 judge calls (GPT-4o-mini): ~£0.12
Total: ~£0.17–0.22 per full eval run

Compare to Braintrust at £180/month for unlimited runs. At 2 PRs per day, you'd need 900 runs/month to break even. More likely you run 20–30 runs/month, making the DIY approach ~10x cheaper.

The 70% Cost Cut

Once your system is working, add two optimisations:

1. Sampling: Don't eval every test case on every run. Randomly sample 30% of your golden dataset unless you're doing a major model swap. Maintains coverage while cutting costs by 70%.

2. Caching: Hash (input, model_version) pairs and cache judge scores. Identical inputs with identical model versions always get the same score. A Redis cache or even a simple SQLite file works fine.

With these two optimisations, recurring eval costs drop to £0.04–0.07 per run.

What This Won't Catch

Be honest about the limitations:

Subtle tone regressions in edge cases (your golden dataset has to cover them)
Completely new user intents not in your golden set
Factual errors in domains where your judge prompt doesn't have domain knowledge

For those, you still need human review. But this system catches the regression cases — which are 90% of what actually breaks in production.

If you want the full system with the multi-model comparison script (GPT-4o vs Claude vs Gemini side-by-side), the sampling/caching implementation, and how to handle eval drift over time, I've packaged it as a complete playbook: The Indie Hacker's LLM Eval Playbook — £29, instant download.

The code above is a taste of what's inside. The playbook goes deeper on rubric design, handling model versioning, and scaling from 100 to 10,000 test cases without the cost exploding.

LLM Evaluation for Indie Hackers: Stop Paying Braintrust and Build This Instead

Charlie Hadley — Mon, 18 May 2026 18:04:41 +0000

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.

Building Eval-as-Code in GitHub Actions

I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.

Here's the core idea:

Define quality as a rubric, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.
Create golden datasets. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.
Use a cheap judge model. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.
Automate in CI. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.

Concrete Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

What's Actually in the Playbook

I've packaged this into a complete system:

Golden dataset templates for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)
Rubric-scoring system: the exact Python code to score outputs
Multi-model comparison scripts: compare GPT-4o vs Claude vs Gemini on identical cases
Complete GitHub Actions workflow: copy-paste, no tweaking needed
Cost optimization: batch evals, cache responses, use cheaper models for coarse filtering

The full system is documented with real examples from my production infrastructure.

Who This Is For

Indie hackers shipping LLM features with no ML team
Startups evaluating multiple models before scaling
Engineers maintaining LLM systems over time (catch regressions early)
Anyone tired of deploying hope instead of metrics

The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.

Get it: https://hadleyworks.gumroad.com/l/nyzala

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

Charlie Hadley — Mon, 18 May 2026 17:12:54 +0000

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

You've tested your LLM feature manually. It looks great. You ship it.

Three days later, a user reports the output is completely wrong. You dig in, and realise: you changed a prompt last week, and that change broke something subtle you never tested.

This is the most common failure mode for indie developers shipping LLM features. And it's entirely preventable.

The Root Cause: Probabilistic Systems Need Deterministic Tests

Traditional software has a nice property: given the same input, you get the same output. You write a unit test, it passes, you ship with confidence.

LLMs break this property. The same input produces different outputs. Quality degrades gradually as you tweak prompts. Models get updated. Context windows fill up differently.

You can't test LLM systems the same way you test regular code.

What Actually Works: Rubric-Based Evaluation

Instead of "does this output look right?", define quality as a concrete rubric:

Attribute	Description	Scale
Correctness	Is the answer factually accurate?	0–10
Conciseness	Does it avoid unnecessary verbosity?	0–10
Hallucination Risk	Does it cite things it can't know?	0–10
Tone	Does it match the expected register?	0–10
Usefulness	Would a real user find this helpful?	0–10

A judge model (GPT-4o-mini at ~$0.0001/call) scores each output against this rubric automatically. Run 50 test cases, aggregate scores, and if your composite score drops below a threshold — the PR fails.

This is eval-as-code.

The Golden Dataset Problem

The hardest part is building test cases. Here's the key insight most guides miss:

Start with failures, not successes.

Every time your LLM makes a mistake in production or testing:

Save the input
Write down what the correct output should have been
Add it to golden_dataset.json

After 2–3 weeks, you'll have 30–50 test cases that represent real failure modes — far more valuable than synthetic examples you invented. A golden dataset built from real failures will catch real regressions.

Running This in GitHub Actions

Here's the minimal CI integration:

name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evals
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check threshold
        run: python check_threshold.py --min-score 7.5

If aggregate score drops below 7.5, check_threshold.py exits with code 1 — the PR is blocked. Simple, deterministic gating on a probabilistic system.

Total cost to run 50 evals: about £0.20.

Multi-Model Comparison Before You Commit

Before paying for GPT-4o, run your eval suite across providers:

models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
for model in models:
    score = run_eval_suite(model, golden_dataset)
    cost = calculate_cost(model, token_count)
    print(f"{model}: score={score:.1f}, cost=£{cost:.3f}")

You'll often find that Claude Haiku or GPT-4o-mini scores 90%+ as well as GPT-4o at 20% of the cost. Don't pay for intelligence you don't need.

A Real Example

I shipped a classification system prompt update to improve response formatting. It looked solid in manual testing on 5 examples. I accidentally dropped a critical piece of context the model needed.

Without evals: ships to users. Angry tickets. Rollback. Lost trust.

With this setup: CI caught the regression in 4 minutes. PR failed. Fixed the prompt. Shipped cleanly.

That one catch alone justified the entire system.

What I've Packaged

I've turned this into a complete, ready-to-use system — The Indie Hacker's LLM Eval Playbook:

6 golden dataset templates (classification, summarization, retrieval, generation, code review, reasoning)
Complete rubric scoring system in Python (copy-paste ready)
Multi-model comparison script with cost-efficiency ranking
GitHub Actions workflow — drop it in and it works
Cost optimisation guide with real benchmarks

£29 one-time. One prevented production incident pays for it 10× over.

Questions about implementing this? Drop them in the comments.

Why Your LLM Prompt Breaks in Production (And How to Fix It Before Shipping)

Charlie Hadley — Mon, 18 May 2026 16:35:34 +0000

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is now returning hallucinated data. You rollback. You lost an hour of revenue.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic—the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets don't exist.

Building Eval-as-Code in GitHub Actions

I've been shipping LLM features for indie products for the past year. I built a rubric-based evaluation system that runs in CI and costs about £0.20 per full eval run.

Here's the core idea:

Define quality as a rubric, not vibes. Instead of "does this look good?", you write: correctness, conciseness, tone, hallucination-risk, usefulness. 5-10 concrete attributes.
Create golden datasets. For each use case (classification, summarization, retrieval, generation, etc.), build 20-50 test cases with expected outputs.
Use a cheap judge model. GPT-4o-mini scores each output against your rubric. Cost: pennies per eval.
Automate in CI. GitHub Actions runs the evals on every PR. If scores drop below threshold, the PR fails.

Concrete Example From Production

Without evals: that ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

What's Actually in the Playbook

I've packaged this into a complete system:

Golden dataset templates for 6 common LLM use cases (classification, summarization, retrieval, generation, code, reasoning)
Rubric-scoring system: the exact Python code to score outputs
Multi-model comparison scripts: compare GPT-4o vs Claude vs Gemini on identical cases
Complete GitHub Actions workflow: copy-paste, no tweaking needed
Cost optimization: batch evals, cache responses, use cheaper models for coarse filtering

The full system is documented with real examples from my production infrastructure.

Who This Is For

Indie hackers shipping LLM features with no ML team
Startups evaluating multiple models before scaling
Engineers maintaining LLM systems over time (catch regressions early)
Anyone tired of deploying hope instead of metrics

The playbook is £29 one-time. You run it once, you've paid for itself by avoiding one bad production deployment.

Get it: https://hadleyworks.gumroad.com/l/nyzala

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

Charlie Hadley — Mon, 18 May 2026 16:35:21 +0000

LLM Evaluation in CI: Stop Manual Testing Before It Costs You

You ship a prompt change to production. Two hours later, a customer complains your LLM is returning hallucinated data. You rollback. You lost an hour of revenue and some user trust.

This happens because you tested the happy path, not the edge cases. LLM systems are probabilistic — the same input doesn't always produce the same output quality.

The enterprise solution is Braintrust ($249/mo), LangSmith ($99/mo), or Arize. If you're indie, bootstrapped, or pre-PMF, those budgets simply don't exist.

The Core Idea: Eval-as-Code

Instead of vibes-based testing, you define quality as a rubric with concrete attributes:

Correctness (0–10): Is the answer factually right?
Conciseness (0–10): Does it avoid unnecessary padding?
Hallucination risk (0–10): Does it cite things it can't know?
Tone (0–10): Does it match expected register?
Usefulness (0–10): Would a real user find this helpful?

A cheap judge model (GPT-4o-mini at ~$0.0001/call) scores each output against your rubric. You run 50 test cases per eval. Total cost: about £0.20 per full run.

Building This in GitHub Actions

Here's the minimal structure:

name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run evals
        run: python run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check threshold
        run: python check_threshold.py --min-score 7.5

The run_evals.py script:

Loads your golden dataset (JSON file of input/expected-output pairs)
Runs your LLM system on each input
Sends (input, expected, actual) to GPT-4o-mini with your rubric
Aggregates scores by attribute
Writes results to eval_results.json

If aggregate score drops below your threshold, check_threshold.py exits with code 1 — the PR fails.

A Real Example From Production

I changed a classification system prompt to improve response formatting. The change looked solid in manual testing on 5 examples. But I accidentally dropped a critical piece of context the model needed for correct classification.

Without evals: ships to users. Angry support tickets. Rollback. Lost trust.

With evals: CI caught it in 4 minutes. PR fails. I fix the prompt. Evals pass. Ship confidently.

Golden Datasets: The Hard Part

The hardest part is building your test cases. The key insight: start with failures, not successes.

Every time your LLM system makes a mistake:

Save the input
Write down what the correct output should have been
Add it to your golden dataset

After 2–3 weeks of normal usage, you'll have 30–50 meaningful test cases that represent real failure modes — far more valuable than synthetic test cases you invented upfront.

Multi-Model Comparison

Before committing to an expensive model, run your eval suite across providers:

models = ["gpt-4o-mini", "gpt-4o", "claude-3-5-haiku", "gemini-flash-1.5"]
results = {}
for model in models:
    results[model] = run_eval_suite(model, golden_dataset)

# Sort by (score / cost_per_1k_tokens) to find optimal tradeoff

This stops you from paying for GPT-4o when Claude Haiku scores 92% as well at 20% of the cost.

Cost Optimization

Batch your calls: OpenAI batch API gives 50% discount on async evals
Cache responses: Hash (model + prompt + input) → cache hit avoids re-scoring
Coarse-to-fine: Use a 2-stage system — cheap model filters obvious passes, expensive model only sees borderline cases
Weekly CI only: Run full suite on PRs to main, not every commit

A well-optimized setup runs 100 eval cases for under £0.10.

What I've Packaged Up

I've turned this into a complete ready-to-use system in The Indie Hacker's LLM Eval Playbook:

6 golden dataset templates for common LLM tasks (classification, summarization, retrieval, generation, code review, reasoning)
Complete rubric scoring system in Python (copy-paste ready)
Multi-model comparison script with cost-efficiency ranking
GitHub Actions workflow — drop it in your repo and it works
Cost optimization guide with benchmarks

£29 one-time. One avoided production incident pays for it 10× over.

If you have questions about implementing eval-as-code for your specific use case, drop them in the comments — happy to help.

I Built LLM Evaluation-as-Code in CI: Here's How to Avoid Shipping Regressions

Charlie Hadley — Mon, 18 May 2026 16:28:10 +0000

API Rate Limiting Playbook: Protect Your Backend From Abuse

The Problem

Your API is live in production. Traffic is growing. Then one day, a bot discovers your endpoint and starts hammering it with 100,000 requests per second. Your database melts. Your users see 500 errors. You lose revenue and reputation.

Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.

This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.

Why Most Indie Teams Skip Rate Limiting

Rate limiting sounds complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?

In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.

You need a methodology. Once you have methodology, the implementation is trivial.

The Three-Layer Strategy

Layer 1: IP-Based Rate Limiting (Nginx)

First line of defense: block obvious bots and abusers at the edge.

limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=auth:10m rate=1r/s;

server {
    location /api/ {
        limit_req zone=general burst=20 nodelay;
    }

    location /api/auth/login {
        limit_req zone=auth burst=3 nodelay;
    }
}

Cost: $0 (Nginx is free).

Setup time: 15 minutes.

Blocks: 95% of bot traffic and accidental DDoS.

Layer 2: User/Token-Based Rate Limiting (Redis + Python)

Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.

Instead, rate limit per API key or user ID:

import redis
from datetime import datetime, timedelta

r = redis.Redis()

def is_rate_limited(user_id, limit=100, window_seconds=3600):
    key = f"rate_limit:{user_id}:{int(datetime.now().timestamp() // window_seconds)}"
    current = r.incr(key)
    r.expire(key, window_seconds)
    return current > limit

@app.route('/api/resource')
def get_resource():
    if is_rate_limited(current_user.id):
        return {'error': 'Rate limit exceeded'}, 429
    return process_request()

Cost: Redis Cloud free tier (up to 30MB).

Setup time: 30 minutes.

Blocks: Authenticated abuse, account enumeration, brute-force attacks.

Layer 3: Endpoint-Specific Thresholds

Different endpoints have different abuse vectors:

Public endpoints (search, info): 100 req/min per IP
Auth endpoints (login, signup): 5 req/min per IP + distributed rate limit
Resource creation (write APIs): 10 req/min per user
Admin endpoints: 1000 req/day per user (tight control)

Document these in your API spec. Expose rate limit headers to clients:

response.headers['X-RateLimit-Limit'] = '100'
response.headers['X-RateLimit-Remaining'] = '87'
response.headers['X-RateLimit-Reset'] = unix_timestamp

Real-World Cost Breakdown

Component	Cost
Nginx configuration	$0
Redis Cloud (free tier)	$0
Monitoring + alerts	$0–10/month (CloudWatch or Datadog free tier)
Total	$0–10/month

Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.

Implementation Checklist

[ ] Deploy Nginx rate limiting (zone + limit_req directive)
[ ] Set up Redis account (free tier)
[ ] Write rate limit middleware in your framework
[ ] Define endpoint-specific limits
[ ] Add rate limit headers to responses
[ ] Test with Apache Bench or Vegeta load testing tool
[ ] Set up alerts (Slack notification when a user hits limits)
[ ] Document rate limits in your API docs

Time to implement: 2–4 hours.

Cost: $0 (for 95% of use cases).

Common Mistakes to Avoid

Only IP-based limiting: Punishes corporate networks and VPNs.
No graduated response: Ban immediately instead of throttling first.
Storing counts in database: Too slow. Use Redis or in-memory cache.
Not exposing rate limit headers: Clients can't intelligently back off.
Ignoring health check endpoints: Don't rate limit your own monitoring.

Debugging Rate Limit Issues

When a user reports "API blocked", here's how to troubleshoot:

Check Redis keys: redis-cli KEYS "rate_limit:*"
Inspect their request pattern: high burst vs sustained?
Whitelist their IP/user if it's a legitimate use case
Adjust thresholds based on real traffic patterns

Next Steps

This playbook includes:

Ready-to-deploy Nginx configs for all major frameworks
Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)
Complete Python/Node.js middleware code
GitHub Actions workflow for load testing
Real abuse patterns from production SaaS systems
Cost optimization strategies (cache tiers, fallback limits)
Comprehensive debugging guide
Whitelist/bypass strategies for trusted partners

Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.

Deploy today.

How to Catch LLM Regressions in CI: The Rubric-Based Eval System That Works

Charlie Hadley — Mon, 18 May 2026 16:13:08 +0000

API Rate Limiting Playbook: Protect Your Backend From Abuse

The Problem

Or worse: a malicious actor uses your API to brute-force user accounts. You didn't have rate limiting in place. You're liable.

This is the silent killer of indie SaaS. You ship the product. You don't ship the protection. Then production breaks.

Why Most Indie Teams Skip Rate Limiting

Rate limiting sounds complicated. "Distributed rate limiting"? "Token bucket algorithm"? "Redis backing stores"?

In reality, it's simple. And you don't need expensive tools. You don't need AWS API Gateway ($0.35 per million requests). You don't need third-party middleware.

You need a methodology. Once you have methodology, the implementation is trivial.

The Three-Layer Strategy

Layer 1: IP-Based Rate Limiting (Nginx)

First line of defense: block obvious bots and abusers at the edge.

limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=auth:10m rate=1r/s;

server {
    location /api/ {
        limit_req zone=general burst=20 nodelay;
    }

    location /api/auth/login {
        limit_req zone=auth burst=3 nodelay;
    }
}

Cost: $0 (Nginx is free).

Setup time: 15 minutes.

Blocks: 95% of bot traffic and accidental DDoS.

Layer 2: User/Token-Based Rate Limiting (Redis + Python)

Your authenticated users have legitimate spikes. A single IP-based rule punishes them unfairly.

Instead, rate limit per API key or user ID:

import redis
from datetime import datetime, timedelta

r = redis.Redis()

def is_rate_limited(user_id, limit=100, window_seconds=3600):
    key = f"rate_limit:{user_id}:{int(datetime.now().timestamp() // window_seconds)}"
    current = r.incr(key)
    r.expire(key, window_seconds)
    return current > limit

@app.route('/api/resource')
def get_resource():
    if is_rate_limited(current_user.id):
        return {'error': 'Rate limit exceeded'}, 429
    return process_request()

Cost: Redis Cloud free tier (up to 30MB).

Setup time: 30 minutes.

Blocks: Authenticated abuse, account enumeration, brute-force attacks.

Layer 3: Endpoint-Specific Thresholds

Different endpoints have different abuse vectors:

Public endpoints (search, info): 100 req/min per IP
Auth endpoints (login, signup): 5 req/min per IP + distributed rate limit
Resource creation (write APIs): 10 req/min per user
Admin endpoints: 1000 req/day per user (tight control)

Document these in your API spec. Expose rate limit headers to clients:

response.headers['X-RateLimit-Limit'] = '100'
response.headers['X-RateLimit-Remaining'] = '87'
response.headers['X-RateLimit-Reset'] = unix_timestamp

Real-World Cost Breakdown

Component	Cost
Nginx configuration	$0
Redis Cloud (free tier)	$0
Monitoring + alerts	$0–10/month (CloudWatch or Datadog free tier)
Total	$0–10/month

Compare to AWS API Gateway: $0.35 per million requests = $3,500/month at scale.

Implementation Checklist

[ ] Deploy Nginx rate limiting (zone + limit_req directive)
[ ] Set up Redis account (free tier)
[ ] Write rate limit middleware in your framework
[ ] Define endpoint-specific limits
[ ] Add rate limit headers to responses
[ ] Test with Apache Bench or Vegeta load testing tool
[ ] Set up alerts (Slack notification when a user hits limits)
[ ] Document rate limits in your API docs

Time to implement: 2–4 hours.

Cost: $0 (for 95% of use cases).

Common Mistakes to Avoid

Only IP-based limiting: Punishes corporate networks and VPNs.
No graduated response: Ban immediately instead of throttling first.
Storing counts in database: Too slow. Use Redis or in-memory cache.
Not exposing rate limit headers: Clients can't intelligently back off.
Ignoring health check endpoints: Don't rate limit your own monitoring.

Debugging Rate Limit Issues

When a user reports "API blocked", here's how to troubleshoot:

Check Redis keys: redis-cli KEYS "rate_limit:*"
Inspect their request pattern: high burst vs sustained?
Whitelist their IP/user if it's a legitimate use case
Adjust thresholds based on real traffic patterns

Next Steps

This playbook includes:

Ready-to-deploy Nginx configs for all major frameworks
Redis setup guide (AWS ElastiCache, DigitalOcean, Heroku)
Complete Python/Node.js middleware code
GitHub Actions workflow for load testing
Real abuse patterns from production SaaS systems
Cost optimization strategies (cache tiers, fallback limits)
Comprehensive debugging guide
Whitelist/bypass strategies for trusted partners

Implementing rate limiting takes 2–4 hours. Ignoring it costs you production incidents and security breaches.

Deploy today.

How to Run LLM Evaluations in CI Without Paying $249/Month

Charlie Hadley — Mon, 18 May 2026 15:47:46 +0000

How to Run LLM Evaluations in CI Without Paying $249/Month

If you're building LLM-powered features as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no systematic way to know if they're actually improving after each change.

The obvious answer is Braintrust or LangSmith. But at $249/month minimum, that's a massive commitment for a pre-PMF product. Here's how to build a production-grade eval pipeline for under $5/month.

The Core Architecture

You need three things:

A golden dataset — A CSV of 50-200 test cases covering your edge cases, with input + expected behavior description
A scoring function — LLM-as-judge using GPT-4o-mini (~$0.002 per example)
GitHub Actions integration — Runs your eval suite on every PR with a score threshold check

The magic: your CI pipeline fails the build if average quality drops below your threshold. No more shipping prompt regressions.

Why Rubric-Based Scoring Beats Exact Match

The biggest mistake teams make: they try to match exact output strings. This fails because LLMs are inherently non-deterministic.

Instead, define what "good" looks like as a checklist rubric:

rubric = """
Score this response 1-5 based on:
- Does it answer the question directly? (1 point)
- Is it concise (under 200 words)? (1 point)  
- Does it avoid hallucinating specific numbers? (1 point)
- Is the tone professional? (1 point)
- Would a user find this genuinely useful? (1 point)
"""

Then let GPT-4o-mini score each response against this rubric. At $0.002 per evaluation, running 100 test cases costs $0.20.

The GitHub Actions Workflow

name: LLM Eval CI
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run eval suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          pip install openai pandas
          python eval/run_suite.py --threshold 3.5

The --threshold 3.5 means: if average score drops below 3.5/5.0, fail the PR. This is your quality gate.

The Multi-Model Comparison Pattern

Before you commit to GPT-4o for your feature, run your eval suite against Claude 3.5 Haiku and Gemini Flash. You'll often find that a cheaper model scores within 0.2 points of the expensive one — at 1/10th the cost.

This comparison takes 10 minutes to set up but can cut your inference costs by 60-80%.

What This Catches in Practice

Real scenario: You change your system prompt to fix a formatting issue. Without evals, you ship it. With evals, your CI run shows classification accuracy dropped from 4.2 to 3.1 on the golden dataset. You investigate, find that your formatting fix accidentally removed context the model needed, and fix it before it hits production.

The moment you catch your first regression in CI, the whole system pays for itself.

Building Your Golden Dataset

Start with 50 examples. Pull them from:

Real user queries you've seen in logs
Edge cases you've mentally worried about
Failure modes you've already shipped by accident

Don't try to write expected outputs. Instead, write rubrics describing what good looks like for each category.

Cost Breakdown

Golden dataset (50 examples): $0.10 per full suite run
GitHub Actions: free tier (2,000 minutes/month)
Total monthly cost for 10 PRs/week: ~$4/month

Compare to Braintrust at $249/month.

Getting Started

The hardest part isn't the code — it's building the golden dataset and writing good rubrics. Once those exist, the automation is straightforward.

I've packaged the full methodology into a playbook: golden dataset templates, rubric examples, multi-model comparison scripts, and the complete GitHub Actions workflow. Available at hadleyworks.gumroad.com for $29.

What eval setups are others running at small scale? Happy to discuss approaches in the comments.

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

Charlie Hadley — Mon, 18 May 2026 15:02:43 +0000

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

If you're building an LLM-powered product as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no idea if they're actually getting better (or worse) after each change.

The obvious solution is a dedicated eval platform — Braintrust, Langsmith, Humanloop. But at $249/month for meaningful usage, that's a lot of MRR to justify before you've found product-market fit.

Here's what I've been doing instead, using tools you already have.

The Core Problem With Ad-Hoc Evals

Most indie teams do one of three things:

Vibe-check evals — you prompt it, it feels right, you ship
One-shot spreadsheets — you run 20 examples once, never again
Nothing — you just watch for complaints in Discord

None of these catch regressions. When you change a prompt to fix one thing, you break two others, and you won't know for a week.

A Lightweight Eval Stack That Actually Works

Here's the stack: Golden dataset + GitHub Actions + a simple scoring function.

Step 1: Build a Golden Dataset

A golden dataset is just a CSV with input/expected output pairs. Start with 20-50 examples that cover your edge cases:

input,expected_output,tags
"Summarize this legal clause: ...", "The clause limits liability to...", "legal,summarization"
"What is the capital of France?", "Paris", "factual,simple"

The key insight: you don't need perfect expected outputs. You need rubric-based scoring, not exact match. Define what "good" looks like as a checklist.

Step 2: Write a Scoring Function

For most use cases, a simple LLM-as-judge approach works well:

def score_response(input_text, actual_output, expected_output):
    prompt = f"""
    Rate this LLM response on a scale of 1-5.

    Input: {input_text}
    Expected: {expected_output}  
    Actual: {actual_output}

    Score based on: accuracy, completeness, tone.
    Return JSON: {{"score": X, "reason": "..."}}
    """
    result = openai.chat.completions.create(...)
    return json.loads(result.choices[0].message.content)

Cost per run: ~$0.002 per example with GPT-4o-mini. Running 50 examples costs $0.10. You can run this on every PR.

Step 3: GitHub Actions Integration

name: LLM Eval Suite
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run eval suite
        run: python eval/run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check score threshold
        run: python eval/check_threshold.py --min-score 3.8

Now every PR shows a score. If it drops below 3.8, the check fails. You've just built CI for your prompts.

What This Doesn't Cover

This approach works great for:

Summarization and extraction tasks
Classification (with expected labels)
RAG retrieval quality
Tone/style adherence

It's harder to apply to:

Open-ended creative tasks
Multi-turn conversations
Tasks where "correct" is deeply subjective

For those cases, you need human-in-the-loop evals — but you can still automate the collection of examples and use the human time only for scoring edge cases.

The Real Win: Regression Detection

The moment this system pays off is when you change your system prompt to improve summarization, run the eval suite, and see that your classification accuracy dropped from 4.2 to 3.1. Without this, you'd ship it and wonder why your churn ticked up next week.

The goal isn't perfect evals. The goal is catching regressions before your users do.

Going Deeper

If you want the full methodology — including golden dataset templates, rubric examples, multi-model comparison scripts, and a GitHub Actions workflow you can clone — I packaged everything into a playbook: The Indie Hacker's LLM Eval Playbook (£25, instant download).

But honestly, the approach above will get you 80% of the way there for free.

The main insight: treat your prompts like code. You wouldn't ship a function without tests. Don't ship a prompt without evals.

What eval setup are you running? Curious what others have found works at small scale — drop a comment below.