Evaluating LLMs in Production Without Paying $249/Month for Braintrust
If you're building an LLM-powered product as an indie hacker or small team, you've probably hit this wall: your prompts work great in the playground, but you have no idea if they're actually getting better (or worse) after each change.
The obvious solution is a dedicated eval platform — Braintrust, Langsmith, Humanloop. But at $249/month for meaningful usage, that's a lot of MRR to justify before you've found product-market fit.
Here's what I've been doing instead, using tools you already have.
The Core Problem With Ad-Hoc Evals
Most indie teams do one of three things:
- Vibe-check evals — you prompt it, it feels right, you ship
- One-shot spreadsheets — you run 20 examples once, never again
- Nothing — you just watch for complaints in Discord
None of these catch regressions. When you change a prompt to fix one thing, you break two others, and you won't know for a week.
A Lightweight Eval Stack That Actually Works
Here's the stack: Golden dataset + GitHub Actions + a simple scoring function.
Step 1: Build a Golden Dataset
A golden dataset is just a CSV with input/expected output pairs. Start with 20-50 examples that cover your edge cases:
input,expected_output,tags
"Summarize this legal clause: ...", "The clause limits liability to...", "legal,summarization"
"What is the capital of France?", "Paris", "factual,simple"
The key insight: you don't need perfect expected outputs. You need rubric-based scoring, not exact match. Define what "good" looks like as a checklist.
Step 2: Write a Scoring Function
For most use cases, a simple LLM-as-judge approach works well:
def score_response(input_text, actual_output, expected_output):
prompt = f"""
Rate this LLM response on a scale of 1-5.
Input: {input_text}
Expected: {expected_output}
Actual: {actual_output}
Score based on: accuracy, completeness, tone.
Return JSON: {{"score": X, "reason": "..."}}
"""
result = openai.chat.completions.create(...)
return json.loads(result.choices[0].message.content)
Cost per run: ~$0.002 per example with GPT-4o-mini. Running 50 examples costs $0.10. You can run this on every PR.
Step 3: GitHub Actions Integration
name: LLM Eval Suite
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run eval suite
run: python eval/run_evals.py
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Check score threshold
run: python eval/check_threshold.py --min-score 3.8
Now every PR shows a score. If it drops below 3.8, the check fails. You've just built CI for your prompts.
What This Doesn't Cover
This approach works great for:
- Summarization and extraction tasks
- Classification (with expected labels)
- RAG retrieval quality
- Tone/style adherence
It's harder to apply to:
- Open-ended creative tasks
- Multi-turn conversations
- Tasks where "correct" is deeply subjective
For those cases, you need human-in-the-loop evals — but you can still automate the collection of examples and use the human time only for scoring edge cases.
The Real Win: Regression Detection
The moment this system pays off is when you change your system prompt to improve summarization, run the eval suite, and see that your classification accuracy dropped from 4.2 to 3.1. Without this, you'd ship it and wonder why your churn ticked up next week.
The goal isn't perfect evals. The goal is catching regressions before your users do.
Going Deeper
If you want the full methodology — including golden dataset templates, rubric examples, multi-model comparison scripts, and a GitHub Actions workflow you can clone — I packaged everything into a playbook: The Indie Hacker's LLM Eval Playbook (£25, instant download).
But honestly, the approach above will get you 80% of the way there for free.
The main insight: treat your prompts like code. You wouldn't ship a function without tests. Don't ship a prompt without evals.
What eval setup are you running? Curious what others have found works at small scale — drop a comment below.
Top comments (0)