olko

Posted on Mar 19 • Originally published at olko.substack.com on Feb 27

Your Prompts Don't Have Tests. That's a Problem.

#ai #testing #llm #softwareengineering

Nobody knows when a prompt got worse.

I mean that literally. A teammate tweaks a system prompt on Tuesday - removes a constraint, adds a sentence. By Thursday the AI assistant is giving shorter, vaguer answers. Support tickets go up. On Monday someone finally traces it back to that one edit.

No diff. No test. No review. The prompt lived in a Google Doc, and someone changed it the way you’d change a wiki page.

I keep hearing the same story from different teams, and it always plays out the same way.

Thanks for reading Olko - Tech/Engineering! Subscribe for free to receive new posts and support my work.

Prompts drift. Nobody notices.

Once LLM systems move from prototype to production, a specific failure mode kicks in. Prompts drift - not dramatically, but incrementally. Someone adjusts wording. Someone switches the model. Someone adds a paragraph of context. Each change seems fine on its own. Nobody measures the cumulative effect.

[

](https://substackcdn.com/image/fetch/$s_!8tSM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7117c66-13dc-417d-b866-198a394bab9b_1456x816.png)

This is the LLM equivalent of configuration drift. The prompt that worked last month might not work today, and without measurement, you won’t know until a user complains or your API bill spikes with no explanation.

The deeper issue: prompt quality is treated as vibes. An engineer writes a prompt, eyeballs the output, decides it’s “good enough,” and ships. No baseline. No regression check. No way to compare version A against version B.

For any other production component, we’d call that reckless.

How this started

I kept writing the same prompt structure by hand - persona, task decomposition, constraints, output format - and I kept doing it badly. Inconsistently. Forgetting sections. So I built a CLI that takes a rough intent and produces a structured prompt. No LLM call - the engine is rule-based and deterministic. It classifies your intent into one of eleven task types and applies the right structure.

Here’s what that looks like:

$ promptctl create "review auth middleware for security issues"
# Generates a structured prompt with:
# - Security engineer persona
# - OWASP-aligned review checklist
# - Specific output format (findings table, severity, remediation)

$ promptctl cost --compare
# Shows token cost across 10 models before you spend anything
# Claude Sonnet 4.5: ~$0.018
# GPT-5: ~$0.022
# DeepSeek V3: ~$0.003
# Llama 4 Maverick: ~$0.001

Useful, but the real problem showed up when I looked at cost patterns.

Unstructured prompts waste money in ways that aren’t obvious. The model rambles because nothing constrains it. You send follow-ups because the first response missed the point. Each retry costs tokens. Structured prompts consistently cost 55-71% less across ten models - not because the model gets cheaper, but because you need fewer calls and get tighter output.

What surprised me: showing engineers the cost before they hit send changed their behavior more than any other feature. People make different choices when the price tag is visible.

Treating prompts like code

The interesting shift happened when I started thinking about prompts the way I think about any other artifact that changes over time.

Here’s the pipeline:

[

](https://substackcdn.com/image/fetch/$s_!0NDK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe0eac20-5aaf-4dc6-aa13-8c7eeb97d9b1_5589x780.png)

Versioning. A prompt template is a YAML file with variables and metadata. When you change the body, you bump the version. Old versions stay. You can diff them, you can roll back. We do this with code. Almost nobody does it with prompts.

Quality scoring. The --score flag evaluates prompt structure on a 0-100 scale: does the prompt match the intent, does it have clear sections and constraints, are there common mistakes like duplicate instructions or vague asks. It’s not a replacement for judgment - but it catches obvious failures before you burn tokens.

Baseline comparison. This is where it starts to feel like actual engineering. Concrete example:

You have a code review template. Version 1 scores 82. You tweak the constraints and persona - that’s version 2. Before shipping v2, you evaluate it against the v1 baseline:

$ promptctl score --template=code-review --baseline=v1

Template: code-review
Current (v2): 78/100
Baseline (v1): 82/100
Delta: -4

⚠ Regression detected. v2 scores lower than baseline.
  - Structural quality dropped (missing constraint section)
  - Fidelity unchanged

$ promptctl record --template=code-review
# Saves v2 score to benchmark history
# Next comparison will use v2 as the new baseline

[

](https://substackcdn.com/image/fetch/$s_!oVfc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d0bd8d-7f07-48c7-84e5-3fb3b4be1d11_1485x3433.png)

Wire that into CI and prompt changes get the same regression gate as code changes. A PR that degrades prompt quality fails the pipeline, same as a PR that breaks a test.

The tradeoffs are real. Versioning adds overhead - you have to decide what counts as a “new version” versus a minor edit. Deterministic evaluation at temperature=0 gives reproducible results but won’t tell you how the prompt behaves at higher temperatures or across different models. CI gating creates friction - engineers push back on quality gates for something they consider subjective. My counter: if this prompt runs in production and affects user-facing output, it should meet a minimum bar. And that bar should be measurable.

What I actually learned building this

Determinism beats cleverness. Early versions tried multi-model scoring with fancy aggregation. Output was non-reproducible. I ripped it out. Simple, deterministic evaluation on one model. Less impressive on a slide deck, actually useful for catching regressions.

The generation engine has to be deep. First version produced skeleton prompts - a heading for “Expert role,” a heading for “Constraints,” some placeholder text. Useless. Anyone could write that in thirty seconds. I rewrote the engine with domain knowledge: when someone says “build a virtual football manager,” it needs to surface match simulation mechanics, player attribute models, transfer market economics, engagement loops. Not “Subject: virtual football manager.” The prompt has to know what the user needs but hasn’t thought to ask for yet.

Pipe-anywhere was the right call. Everything goes to stdout. Pipe to Claude CLI, OpenAI CLI, clipboard, file. Cost tracking and scores go to stderr so they don’t pollute the output stream. Basic Unix philosophy, but a lot of developer tools get this wrong.

Feature creep kills CLI tools. I kept wanting to add dashboards, team management, analytics. None of it belonged in a CLI. The rule I settled on: if it doesn’t make the next prompt better or cheaper, it doesn’t ship.

Where this is going

The current version handles prompt creation, cost comparison, and quality scoring. The next step is persistent benchmarks - tracking how prompt versions perform over time, not just at the moment of creation.

But there’s a bigger shift happening that matters more than any feature.

Prompt engineering is becoming operational. Two years ago prompts were a novelty. Today they’re load-bearing. A chatbot resolves support tickets. An assistant generates code that ships to production. A summarizer produces reports that executives act on. The prompt is often the single biggest factor in output quality, and it’s the least tested part of the stack.

We’ve seen this pattern before. Early DevOps was deployment scripts in a shared folder. They worked - until they didn’t. The shift from “deployment scripts” to “deployment pipelines” wasn’t a tooling change. It was teams recognizing that something critical needed discipline, not ad-hoc management.

Prompt workflows are at that same point. The work happening now is useful but fragile. No regression protection. No quality history. No review process for the thing that most directly controls what your LLM produces.

That will change. Not because prompt engineering is exciting or trendy, but because the cost of not doing it keeps going up. Every production LLM system that ships without prompt quality tracking is accumulating risk the same way teams accumulated risk shipping without tests in 2008.

The question isn’t whether your prompts need regression protection. It’s how expensive the next silent failure has to be before you add it.

This is the first in a series on applying engineering discipline to LLM workflows. Next up: what happens when you actually run prompt regression testing on a real codebase.

[

](https://substackcdn.com/image/fetch/$s_!P-Yz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe461f7fc-6c25-4db5-b8e6-5cc0ad9b1e14_1200x628.png)

promptctl is a free CLI. Install via brew

brew install --cask oleg-koval/tap/promptctl

or grab it from prompt-ctl.com.

DEV Community