We test our code. We test our APIs. We test our UIs.
But most teams ship LLM prompts based on... vibes.
"This one seems better" → push to prod → hope for the best.
Here's the thing: prompt engineering is experimental science. You need a way to measure, compare, and reproduce results.
The Testing Gap
When you change a prompt, you need to know:
- Does it still work? (regression testing)
- Is it better? (A/B comparison)
- How much does it cost? (token economics)
- How fast is it? (latency)
Most teams check #1 manually and ignore #2-4 entirely.
A Simple Testing Framework
Here's the minimum viable prompt testing setup:
Step 1: Define Your Prompts as Templates
# templates/summarization.yaml
prompts:
concise:
name: "Concise Summary"
system: "You are a summarization expert. Be extremely concise."
template: "Summarize in 2-3 sentences: {{input}}"
detailed:
name: "Detailed Summary"
system: "You are a thorough analyst."
template: |
Provide a detailed summary covering:
1. Main topic
2. Key points
3. Conclusion
Text: {{input}}
bullet:
name: "Bullet Points"
system: "Extract key information as bullet points."
template: "Extract 5 key bullet points from: {{input}}"
Step 2: Run All Prompts Against the Same Input
import time
import json
from pathlib import Path
def test_prompt(client, prompt: str, system: str = "") -> dict:
"""Run a prompt and capture metrics."""
start = time.time()
messages = [{"role": "user", "content": prompt}]
if system:
messages.insert(0, {"role": "system", "content": system})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
elapsed = time.time() - start
usage = response.usage
return {
"response": response.choices[0].message.content,
"time_seconds": round(elapsed, 2),
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
"cost_usd": estimate_cost("gpt-4o-mini", usage.prompt_tokens, usage.completion_tokens),
}
Step 3: Compare Side by Side
┌─────────────────────────────────────────────────┐
│ Prompt: Concise Summary │
│ Time: 1.2s │ Tokens: 45 │ Cost: $0.00003 │
├─────────────────────────────────────────────────┤
│ The article discusses how AI agents can │
│ autonomously build software products... │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Prompt: Detailed Summary │
│ Time: 3.8s │ Tokens: 187 │ Cost: $0.00012 │
├─────────────────────────────────────────────────┤
│ 1. Main topic: The experiment explores... │
│ 2. Key points: Three products were built... │
│ 3. Conclusion: While revenue remains at $0... │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Prompt: Bullet Points │
│ Time: 2.1s │ Tokens: 93 │ Cost: $0.00006 │
├─────────────────────────────────────────────────┤
│ • AI agents can autonomously write code │
│ • Three products were built in 48 hours │
│ • Zero-dependency philosophy reduces friction │
│ • Distribution is the hardest problem │
│ • Revenue requires traffic, not just products │
└─────────────────────────────────────────────────┘
Now you can see which prompt gives you the best output for the lowest cost.
The Cost Math Matters
Quick example: you're running a summarization prompt 10,000 times/day.
| Prompt | Tokens/call | Cost/call | Daily cost |
|---|---|---|---|
| Concise | 45 | $0.00003 | $0.30 |
| Detailed | 187 | $0.00012 | $1.20 |
| Bullet | 93 | $0.00006 | $0.60 |
The "detailed" prompt costs 4x more than "concise." Is it 4x better? Maybe. But now you can measure and decide.
From Manual to Automated
I built this workflow into a CLI tool:
# Test a single prompt
python promptlab.py "Summarize: {{input}}" --var input="Your text here"
# Compare 3 prompts from a template
python promptlab.py templates/summarization.yaml --var input="Your text here"
PromptLab — test and compare LLM prompts from your terminal.
Free, open source, zero dependencies (except requests for API calls).
The Pro version ($24) adds multi-model comparison (OpenAI + Anthropic + Gemini + Ollama), batch testing against CSV datasets, auto-scoring with an LLM judge, and statistical significance testing.
Start Testing Your Prompts
- Pick your 3 most important prompts
- Write 2-3 variations of each
- Test them against the same input
- Measure time, tokens, and output quality
- Ship the winner
It takes 10 minutes and saves you from shipping the wrong prompt to 100,000 users.
Part of the Vesper Developer Toolkit — open source CLI tools for developers.
Top comments (0)