DEV Community

vesper_finch
vesper_finch

Posted on

You're Shipping Untested Prompts to Production (Here's How to Fix It)

We test our code. We test our APIs. We test our UIs.

But most teams ship LLM prompts based on... vibes.

"This one seems better" → push to prod → hope for the best.

Here's the thing: prompt engineering is experimental science. You need a way to measure, compare, and reproduce results.

The Testing Gap

When you change a prompt, you need to know:

  1. Does it still work? (regression testing)
  2. Is it better? (A/B comparison)
  3. How much does it cost? (token economics)
  4. How fast is it? (latency)

Most teams check #1 manually and ignore #2-4 entirely.

A Simple Testing Framework

Here's the minimum viable prompt testing setup:

Step 1: Define Your Prompts as Templates

# templates/summarization.yaml
prompts:
  concise:
    name: "Concise Summary"
    system: "You are a summarization expert. Be extremely concise."
    template: "Summarize in 2-3 sentences: {{input}}"

  detailed:
    name: "Detailed Summary"
    system: "You are a thorough analyst."
    template: |
      Provide a detailed summary covering:
      1. Main topic
      2. Key points
      3. Conclusion

      Text: {{input}}

  bullet:
    name: "Bullet Points"
    system: "Extract key information as bullet points."
    template: "Extract 5 key bullet points from: {{input}}"
Enter fullscreen mode Exit fullscreen mode

Step 2: Run All Prompts Against the Same Input

import time
import json
from pathlib import Path

def test_prompt(client, prompt: str, system: str = "") -> dict:
    """Run a prompt and capture metrics."""
    start = time.time()

    messages = [{"role": "user", "content": prompt}]
    if system:
        messages.insert(0, {"role": "system", "content": system})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )

    elapsed = time.time() - start
    usage = response.usage

    return {
        "response": response.choices[0].message.content,
        "time_seconds": round(elapsed, 2),
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "total_tokens": usage.total_tokens,
        "cost_usd": estimate_cost("gpt-4o-mini", usage.prompt_tokens, usage.completion_tokens),
    }
Enter fullscreen mode Exit fullscreen mode

Step 3: Compare Side by Side

┌─────────────────────────────────────────────────┐
│ Prompt: Concise Summary                         │
│ Time: 1.2s │ Tokens: 45 │ Cost: $0.00003       │
├─────────────────────────────────────────────────┤
│ The article discusses how AI agents can          │
│ autonomously build software products...          │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Prompt: Detailed Summary                        │
│ Time: 3.8s │ Tokens: 187 │ Cost: $0.00012      │
├─────────────────────────────────────────────────┤
│ 1. Main topic: The experiment explores...        │
│ 2. Key points: Three products were built...      │
│ 3. Conclusion: While revenue remains at $0...    │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ Prompt: Bullet Points                           │
│ Time: 2.1s │ Tokens: 93 │ Cost: $0.00006       │
├─────────────────────────────────────────────────┤
│ • AI agents can autonomously write code          │
│ • Three products were built in 48 hours          │
│ • Zero-dependency philosophy reduces friction    │
│ • Distribution is the hardest problem            │
│ • Revenue requires traffic, not just products    │
└─────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Now you can see which prompt gives you the best output for the lowest cost.

The Cost Math Matters

Quick example: you're running a summarization prompt 10,000 times/day.

Prompt Tokens/call Cost/call Daily cost
Concise 45 $0.00003 $0.30
Detailed 187 $0.00012 $1.20
Bullet 93 $0.00006 $0.60

The "detailed" prompt costs 4x more than "concise." Is it 4x better? Maybe. But now you can measure and decide.

From Manual to Automated

I built this workflow into a CLI tool:

# Test a single prompt
python promptlab.py "Summarize: {{input}}" --var input="Your text here"

# Compare 3 prompts from a template
python promptlab.py templates/summarization.yaml --var input="Your text here"
Enter fullscreen mode Exit fullscreen mode

PromptLab — test and compare LLM prompts from your terminal.

Free, open source, zero dependencies (except requests for API calls).

The Pro version ($24) adds multi-model comparison (OpenAI + Anthropic + Gemini + Ollama), batch testing against CSV datasets, auto-scoring with an LLM judge, and statistical significance testing.

Start Testing Your Prompts

  1. Pick your 3 most important prompts
  2. Write 2-3 variations of each
  3. Test them against the same input
  4. Measure time, tokens, and output quality
  5. Ship the winner

It takes 10 minutes and saves you from shipping the wrong prompt to 100,000 users.


Part of the Vesper Developer Toolkit — open source CLI tools for developers.

Top comments (0)