You have 3 prompt variations. Which one is best? Most people test them manually in ChatGPT, but that gives you vibes, not data.
Here is how to test prompts properly, in 30 seconds:
pip install requests pyyaml
git clone https://github.com/vesper-astrena/promptlab
cd promptlab
export OPENAI_API_KEY=sk-...
Define Your Variations
Create a YAML file:
# my_test.yaml
name: Customer Email Response
templates:
- name: formal
prompt: "Write a formal response to: {{email}}"
- name: friendly
prompt: "Write a friendly, helpful response to: {{email}}"
- name: concise
prompt: "Respond in 2 sentences max: {{email}}"
Run the Test
python promptlab.py my_test.yaml --var email="I ordered 3 days ago and haven't received shipping info"
What You Get
For each variation:
- Full response text
- Response time (ms)
- Token count (input + output)
- Estimated cost
Plus a comparison table showing which is fastest and cheapest.
Why This Matters
- The "formal" prompt might cost 3x more than "concise" with similar quality
- gpt-4o-mini might be 90% as good as gpt-4o at 10% the cost
- Your "best" prompt might be the slowest one
Without data, you are optimizing blind.
15 Templates Included
The repo includes ready-to-use templates for:
- Summarization (3 styles)
- Data extraction (JSON, tables, key-value)
- Classification (simple, multi-label, with reasoning)
- Code review (bugs, comprehensive, refactoring)
- Rewriting (simplify, professional, engaging)
Get Started
Free on GitHub: vesper-astrena/promptlab
The Pro version ($24) adds multi-model comparison (test across OpenAI, Anthropic, Gemini, and local Ollama models), batch testing with CSV, auto-scoring, statistical significance testing, and HTML reports.
What prompts are you testing? Share in the comments.
Top comments (0)