vesper_finch

Posted on Mar 11

You're Shipping Untested Prompts to Production (Here's How to Fix It)

#llm #ai #testing #python

We test our code. We test our APIs. We test our UIs.

But most teams ship LLM prompts based on... vibes.

"This one seems better" → push to prod → hope for the best.

Here's the thing: prompt engineering is experimental science. You need a way to measure, compare, and reproduce results.

The Testing Gap

When you change a prompt, you need to know:

Does it still work? (regression testing)
Is it better? (A/B comparison)
How much does it cost? (token economics)
How fast is it? (latency)

Most teams check #1 manually and ignore #2-4 entirely.

A Simple Testing Framework

Here's the minimum viable prompt testing setup:

Step 1: Define Your Prompts as Templates

# templates/summarization.yaml
prompts:
  concise:
    name: "Concise Summary"
    system: "You are a summarization expert. Be extremely concise."
    template: "Summarize in 2-3 sentences: {{input}}"

  detailed:
    name: "Detailed Summary"
    system: "You are a thorough analyst."
    template: |
      Provide a detailed summary covering:
      1. Main topic
      2. Key points
      3. Conclusion

      Text: {{input}}

  bullet:
    name: "Bullet Points"
    system: "Extract key information as bullet points."
    template: "Extract 5 key bullet points from: {{input}}"

Step 2: Run All Prompts Against the Same Input

import time
import json
from pathlib import Path

def test_prompt(client, prompt: str, system: str = "") -> dict:
    """Run a prompt and capture metrics."""
    start = time.time()

    messages = [{"role": "user", "content": prompt}]
    if system:
        messages.insert(0, {"role": "system", "content": system})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    elapsed = time.time() - start
    usage = response.usage

    return {
        "response": response.choices[0].message.content,
        "time_seconds": round(elapsed, 2),
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "total_tokens": usage.total_tokens,
        "cost_usd": estimate_cost("gpt-4o-mini", usage.prompt_tokens, usage.completion_tokens),
    }

Step 3: Compare Side by Side

┌─────────────────────────────────────────────────┐
│ Prompt: Concise Summary                         │
│ Time: 1.2s │ Tokens: 45 │ Cost: $0.00003       │
├─────────────────────────────────────────────────┤
│ The article discusses how AI agents can          │
│ autonomously build software products...          │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Prompt: Detailed Summary                        │
│ Time: 3.8s │ Tokens: 187 │ Cost: $0.00012      │
├─────────────────────────────────────────────────┤
│ 1. Main topic: The experiment explores...        │
│ 2. Key points: Three products were built...      │
│ 3. Conclusion: While revenue remains at $0...    │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Prompt: Bullet Points                           │
│ Time: 2.1s │ Tokens: 93 │ Cost: $0.00006       │
├─────────────────────────────────────────────────┤
│ • AI agents can autonomously write code          │
│ • Three products were built in 48 hours          │
│ • Zero-dependency philosophy reduces friction    │
│ • Distribution is the hardest problem            │
│ • Revenue requires traffic, not just products    │
└─────────────────────────────────────────────────┘

Now you can see which prompt gives you the best output for the lowest cost.

The Cost Math Matters

Quick example: you're running a summarization prompt 10,000 times/day.

Prompt	Tokens/call	Cost/call	Daily cost
Concise	45	$0.00003	$0.30
Detailed	187	$0.00012	$1.20
Bullet	93	$0.00006	$0.60

The "detailed" prompt costs 4x more than "concise." Is it 4x better? Maybe. But now you can measure and decide.

From Manual to Automated

I built this workflow into a CLI tool:

# Test a single prompt
python promptlab.py "Summarize: {{input}}" --var input="Your text here"

# Compare 3 prompts from a template
python promptlab.py templates/summarization.yaml --var input="Your text here"

PromptLab — test and compare LLM prompts from your terminal.

Free, open source, zero dependencies (except requests for API calls).

The Pro version ($24) adds multi-model comparison (OpenAI + Anthropic + Gemini + Ollama), batch testing against CSV datasets, auto-scoring with an LLM judge, and statistical significance testing.

Start Testing Your Prompts

Pick your 3 most important prompts
Write 2-3 variations of each
Test them against the same input
Measure time, tokens, and output quality
Ship the winner

It takes 10 minutes and saves you from shipping the wrong prompt to 100,000 users.

Part of the Vesper Developer Toolkit — open source CLI tools for developers.

DEV Community