Charlie Hadley

Posted on May 18

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

#llm #startup #ai #productivity

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

If you're building an LLM-powered product as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no idea if they're actually getting better (or worse) after each change.

The obvious solution is a dedicated eval platform — Braintrust, Langsmith, Humanloop. But at $249/month for meaningful usage, that's a lot of MRR to justify before you've found product-market fit.

Here's what I've been doing instead, using tools you already have.

The Core Problem With Ad-Hoc Evals

Most indie teams do one of three things:

Vibe-check evals — you prompt it, it feels right, you ship
One-shot spreadsheets — you run 20 examples once, never again
Nothing — you just watch for complaints in Discord

None of these catch regressions. When you change a prompt to fix one thing, you break two others, and you won't know for a week.

A Lightweight Eval Stack That Actually Works

Here's the stack: Golden dataset + GitHub Actions + a simple scoring function.

Step 1: Build a Golden Dataset

A golden dataset is just a CSV with input/expected output pairs. Start with 20-50 examples that cover your edge cases:

input,expected_output,tags
"Summarize this legal clause: ...", "The clause limits liability to...", "legal,summarization"
"What is the capital of France?", "Paris", "factual,simple"

The key insight: you don't need perfect expected outputs. You need rubric-based scoring, not exact match. Define what "good" looks like as a checklist.

Step 2: Write a Scoring Function

For most use cases, a simple LLM-as-judge approach works well:

def score_response(input_text, actual_output, expected_output):
    prompt = f"""
    Rate this LLM response on a scale of 1-5.

    Input: {input_text}
    Expected: {expected_output}  
    Actual: {actual_output}

    Score based on: accuracy, completeness, tone.
    Return JSON: {{"score": X, "reason": "..."}}
    """
    result = openai.chat.completions.create(...)
    return json.loads(result.choices[0].message.content)

Cost per run: ~$0.002 per example with GPT-4o-mini. Running 50 examples costs $0.10. You can run this on every PR.

Step 3: GitHub Actions Integration

name: LLM Eval Suite
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run eval suite
        run: python eval/run_evals.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Check score threshold
        run: python eval/check_threshold.py --min-score 3.8

Now every PR shows a score. If it drops below 3.8, the check fails. You've just built CI for your prompts.

What This Doesn't Cover

This approach works great for:

Summarization and extraction tasks
Classification (with expected labels)
RAG retrieval quality
Tone/style adherence

It's harder to apply to:

Open-ended creative tasks
Multi-turn conversations
Tasks where "correct" is deeply subjective

For those cases, you need human-in-the-loop evals — but you can still automate the collection of examples and use the human time only for scoring edge cases.

The Real Win: Regression Detection

The moment this system pays off is when you change your system prompt to improve summarization, run the eval suite, and see that your classification accuracy dropped from 4.2 to 3.1. Without this, you'd ship it and wonder why your churn ticked up next week.

The goal isn't perfect evals. The goal is catching regressions before your users do.

Going Deeper

If you want the full methodology — including golden dataset templates, rubric examples, multi-model comparison scripts, and a GitHub Actions workflow you can clone — I packaged everything into a playbook: The Indie Hacker's LLM Eval Playbook (£25, instant download).

But honestly, the approach above will get you 80% of the way there for free.

The main insight: treat your prompts like code. You wouldn't ship a function without tests. Don't ship a prompt without evals.

What eval setup are you running? Curious what others have found works at small scale — drop a comment below.

DEV Community

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

Evaluating LLMs in Production Without Paying $249/Month for Braintrust

The Core Problem With Ad-Hoc Evals

A Lightweight Eval Stack That Actually Works

Step 1: Build a Golden Dataset

Step 2: Write a Scoring Function

Step 3: GitHub Actions Integration

What This Doesn't Cover

The Real Win: Regression Detection

Going Deeper

Top comments (0)