DEV Community: Vaulter Prompt

The easy way to stop screaming at AI with CAPS

Vaulter Prompt — Thu, 19 Feb 2026 23:00:00 +0000

Some foundational prompt engineering techniques and patterns to know about.

I think everybody at some point caught themselves typing in all caps, to ChatGPT or Claude, "I LITERALLY JUST TOLD YOU TO DO THIS." Rephrasing the same request just to get the same useless output. Getting progressively angrier at a language model that is specifically designed to drive humanity crazy.

Very often it's not a broken tool, but vague instructions to a system that does exactly what you ask - just not what you mean. And there's a gap between those two things that costs most people hours every single day.

I hope this article can help you to get very quickly to the point, where this is no longer a case for you!

The invisible tax you're already paying

The original promise of AI was that it would do the work for you. And in a way, it does. But it also quietly changes what the work actually is.

Before AI, you spent time writing. Now you spend time reviewing. Checking if the AI got it right. Rephrasing when it didn't. Cleaning up hallucinated requirements. Making sure the output actually says what you meant and not what the model decided you probably meant.

With good prompts, that review step is quick - a sanity check, maybe a small adjustment. With bad prompts, the review becomes the work. You're not using AI anymore. You're babysitting it.

A January 2026 Zapier survey of 1,100 AI users puts a number on this: workers spend an average of 4.5 hours per week revising, correcting, and redoing AI outputs. That's more than half a workday - not writing, not thinking, just cleaning up after a tool that was supposed to save time.

And untrained people are more likely to say AI makes them less productive. Not because the tool is worse for them - because they never learned how to direct it. Meanwhile people with access to prompt training and libraries report productivity gains.

It's like buying a professional DSLR camera and shooting everything in auto mode, then complaining the photos look the same as your phone. The capability is there. You just haven't learned to access it.

The mental model that changes everything

Here's what nobody told us upfront: as a consumer you can think of an LLM as a very sophisticated autocomplete. Of course I'm seriously oversimplifying, but hear me out. It really helps to get things right. The point is: it doesn't "understand" your request. It predicts the most probable next words (in reality tokens) based on everything that it was trained on.

That's it. Not intelligence. Pattern prediction (sophisticated, complicated, groundbreaking, but anyway).

This line of thinking explains almost every frustration you've ever had:

Vague prompts get vague answers - many probable continuations, the model picks one at random
Examples work better than instructions - you're showing it the pattern to continue, not hoping it interprets your intent
Long conversations go off the rails - the model has a finite "context window" (its working memory). Everything in your conversation takes up space, and when it fills up, older content gets dropped or compressed. The AI isn't being thick after 20 messages - it literally cannot see what you said earlier
It "hallucinates" - it predicts plausible text, not true text (hallucinations is a feature, not a bug)

So when you type "make this better" and get back something useless, the AI isn't being stupid. It's doing exactly what autocomplete does with ambiguous input: guessing.

Three rules follow from this.

Be explicit - ambiguity is the enemy.
Show, don't tell - examples constrain the solution space better than descriptions.
Start fresh conversations for fresh tasks - don't let context rot.

Everything below traces back to these three principles.

Four core techniques that actually work

There are four basic prompting techniques that, together, cover pretty much every type of task you'd throw at an AI. They form a ladder - start with #1, escalate when needed:

Step	Technique	One-liner	Use when...
1	Zero-shot	Just ask	Task has one obvious interpretation
2	Few-shot	Show examples	Format or style matters
3	Chain-of-thought	Make it reason	Task needs logic, not pattern matching
4	Prompt chaining	Break it apart	Too complex for a single prompt

The mistake most people make is either staying on step 1 forever (most of the time, really), or jumping straight to step 4 when they didn't need to. I'll walk through each one below - when it works, when it doesn't, and the signal that tells you it's time to move up.

To show how these build on each other, I'll use one example throughout: "Analyse 50 customer feedback entries from last quarter and write a summary for the product team." Same task, four different approaches, very different results.

Start simple: just ask (zero-shot)

"Zero-shot" just means: give a direct instruction, no examples.

For tasks with one obvious interpretation, this is all you need:

"Translate this email to Spanish"
"Extract all deadlines from this contract"
"What are the three biggest risks in this plan?"

The AI already "knows" what "translate" means and what "deadlines" look like. A clear instruction, a clear input, done.

But watch what happens with our feedback analysis:

Prompt: "Summarise the key themes from this customer feedback for the product team."

What you get: A different structure every time. First try: a wall of text with no categories. Second try: bullet points, but random grouping and no prioritisation. Third try: nice categories, but completely different ones from last time. The AI extracts themes fine - but the format and depth change with every run.

"Customer feedback summary" has dozens of valid interpretations. The model picks one at random each time.

So what can we say about this pattern?

Use when:

Task has one clear interpretation
The AI already "knows" the task type (translation, extraction, summarisation)
You don't care about exact format

Signal to escalate:

You're rephrasing the same request 3 times and getting different structures
Content is right but format/style is inconsistent
You need a specific output shape every time

Show, don't tell (few-shot)

So the feedback summary keeps coming back in a random format. You could try describing what you want: "Use a table, group by theme, include frequency count, add a severity column, include one example quote per theme..." But by the time you've written all that, you could have just made the table yourself. Here's few-shot prompting comes to help.

"Few-shot" means: instead of describing what you want, show 3 to 5 examples of it.

Same task - with few-shot:

Analyse the customer feedback below and summarise it
for the product team. Follow this format:

Example:
| Theme | Mentions | Severity | Example quote |
|-------|----------|----------|---------------|
| Slow page loads | 12 | High | "Dashboard takes 8s to load" |
| Missing export | 5 | Medium | "I need CSV export for reports" |

Top priority: Slow page loads - affects daily usage,
12 mentions in 30 days, multiple churn-risk accounts.

Now analyse this feedback:
[50 input entries here]

From the examples AI knows you want a table with those exact columns, followed by a top-priority callout with reasoning. Format, length, structure - communicated in seconds, more precisely than any paragraph of instructions could. This is called "in-context learning" - the model's ability to pick up a pattern from just a few demonstrations and apply it to new input.

Good examples are:

Diverse - different scenarios, not three variations of the same thing
Representative - typical cases, not edge cases
Consistent - same format across all of them
Minimal - 3-5 is usually plenty

Now the format is perfect every time. But there's a problem: the AI lists "slow checkout" (30 mentions) and "button colour" (2 mentions) at the same severity level. It's mimicking the table structure beautifully, but it's not actually thinking about what matters.

Few-shot fails when the task needs:

Actual calculation - not pattern completion
Reasoning - not correlation
Domain knowledge that isn't present in the examples
Multi-factor trade-off judgements - weighing competing priorities
Handling novel constraints the examples don't cover

In short: if the answer requires thinking through the problem and not just matching a format, examples alone won't get you there.

Use when:

Format, tone, or style matters
The task has many valid outputs but you need a specific one
You want consistent results across multiple runs

Signal to escalate:

Format is perfect but the reasoning is wrong
The AI mimics your examples but makes logical errors or misses nuance
The task needs analysis, not pattern matching

Make it think (chain-of-thought)

Examples fix formatting and content depth expectaions, sure. They don't fix thinking. When the task needs actual reasoning, the AI mimics the pattern and jumps to a plausible-looking answer without working through the problem.

That feedback summary has the right columns now, but the priorities are shallow. The AI saw "High/Medium" in your example and just distributed those labels without weighing anything.

Five words fix this: "Let's think step by step."

Same task - with chain-of-thought:
Analyse this customer feedback for the product team.
Use the table format from the examples above.

Before filling in the severity column, think step by
step: consider how many users mentioned it, whether it
causes churn or just annoyance, and how it compares to
other themes.
What you get now:

Slow checkout (30 mentions) → directly causes cart abandonment, mentioned by 3 enterprise accounts → High

Confusing pricing page (8 mentions) → causes support tickets but users still convert → Medium

Button colour (2 mentions) → cosmetic, no impact on conversion → Low

Top priority: Slow checkout - 30 mentions, directly tied to revenue loss, affects highest-value accounts.

That's not a gimmick. Research shows this single phrase improves accuracy on reasoning tasks from 17.7% to 78.7%. By asking the model to show its reasoning, you force it to actually work through the problem instead of guessing.

Same principle as showing your work at school - you catch errors you'd miss if you just wrote the final answer.

Pro tip: self-consistency. For high-stakes decisions, run the same CoT prompt 3 times and compare the answers. If all three agree, you're probably right. If they disagree wildly, the problem needs more breakdown. Costs 3x the tokens but catches blind spots a single run misses.

So for this technique:

Use when:

Debugging, analysis, decisions, math
Anything where "showing work" would help a human
The task has a right answer that requires reasoning to reach

Signal to escalate:

Reasoning per step is fine, but the task has too many moving parts
Output is solid for the first half and falls apart after that
The prompt is getting so long the AI starts ignoring parts of it

Break it down (prompt chaining)

If you're asking for more than one distinct deliverable in a single prompt, you're probably going to get disappointed.

With 50 feedback entries, trying to categorise, assess severity, AND write recommendations in one prompt usually means the categorisation is decent, the severity assessment is rushed, and the recommendations are generic. The model runs out of steam halfway through.

"Prompt chaining" means breaking it into steps, reviewing each one before moving to the next. You literally take the output of prompt 1 and paste it as input into prompt 2.

Why this works better than one big prompt:

Catch errors early - spot problems before 5 steps of compounding
Smaller context - model focuses on one task, not juggling 10 instructions
Easier to debug - you know exactly which step failed
Reusable pieces - swap out step 2 without rewriting 1,200 lines
Human in the loop - review and adjust between steps

Same task - as a chain:

Prompt 1: "Categorise all 50 feedback entries into themes with counts."
→ Output: table with 6 themes (slow checkout: 30, confusing pricing: 8, missing export: 5...)
→ ✓ Review: do these categories make sense? Merge or split any?

Prompt 2: "Here are the themes: [paste output from step 1]. For each theme, assess severity and business impact. Think step by step."
→ Output: slow checkout = High (causes abandonment), confusing pricing = Medium (causes support tickets)...
→ ✓ Review: does the reasoning hold up? Any wrong assumptions?

Prompt 3: "Here's the full analysis: [paste output from step 2]. Write the summary for the product team with top 3 recommendations."
→ Final output: ready to send.

Each step is small enough to actually verify. You catch wrong categories at step 1 instead of discovering them baked into the final recommendations at step 3.

And what do we have here in the result?

Use when:

Task has multiple distinct deliverables
Your prompt is getting so long the AI ignores parts of it
You want to review intermediate results before continuing

Signal that something's off:

Individual steps produce bad reasoning → add chain-of-thought within each step
Chain works but results feel generic → better context or examples needed in step 1
Conversation is sideways after many messages → start a fresh one. Context rots.

Those four techniques are the core of it. But knowing which technique to use is only half the problem. The other half is how you structure the prompt itself - and that's where most people quietly lose hours without realising it.

The patterns nobody teaches you

Techniques tell you what to do. Patterns tell you how to do it well. If techniques are the bricks, these are the cement - and skipping them is why a lot of prompts that should work still don't.

The prompt anatomy

Every prompt has up to four elements. When something goes wrong, one is usually missing:

Element	What it is	If missing...
Instruction	The task to perform	AI guesses what you want
Context	Background, constraints, role	AI makes wrong assumptions
Input data	Content to process	Nothing to work with
Output indicator	Expected format	You get 500 words when you needed 2 bullets

Remember our feedback analysis prompt from the few-shot section? Let's map the four elements onto it:

[INSTRUCTION] Analyse the customer feedback below and summarise it
              for the product team. Follow this format:
[OUTPUT]      Example:
              | Theme | Mentions | Severity | Example quote |
              | Slow page loads | 12 | High | "Dashboard takes 8s..." |
              Top priority: Slow page loads - affects daily usage...
[INPUT DATA]  Now analyse this feedback:
              [50 input entries here]

Two elements present, two missing. There's no context (who's the analyst? what's the review for?) and no delimiters between the instruction and the input data. It works OK because the few-shot examples carry most of the weight - but it could be better. Add the missing elements:

[CONTEXT]     You are a product analyst preparing a quarterly review.
[INSTRUCTION] Analyse the customer feedback below and summarise it
              for the product team. Follow this format:
[OUTPUT]      Example:
              | Theme | Mentions | Severity | Example quote |
              | Slow page loads | 12 | High | "Dashboard takes 8s..." |
              Top priority: Slow page loads - affects daily usage...
[INPUT DATA]  <feedback> [50 input entries here] </feedback>

Now the context shapes the analysis (product analyst prioritises by business impact), and the <feedback> tags separate data from instruction. Same few-shot technique, better prompt structure.

This is also a diagnostic tool. Prompt didn't work as expected? Check which element is vague or absent. Takes ten seconds, fixes most problems on the spot.

Four patterns that save the most time

1. Scope boundaries - tell the AI what NOT to do.

The "eager intern" problem - where the AI helpfully restructures your entire document when you asked it to fix one paragraph - is solved by explicit fences:

ONLY modify: [specific section]
Do NOT: [things you don't want changed]
Match: [existing conventions, tone, style]

Applied to our example: "Only analyse the feedback entries I provide. Do not add themes that aren't in the data. Do not invent quotes."

When to skip: discovery phase, architecture discussions, brainstorming - boundaries kill creativity when you actually want broad thinking.

2. Output specification - define the container.

"Summarise this" gets you 500 words. "Summarise in 3 bullet points, max 15 words each" gets you exactly what you asked for. If you specify the shape - format, length, sections, what to exclude - the AI can't invent things that don't fit.

Applied to our example: this is exactly what the few-shot table format did - but you can also do it without examples, just by describing the container: "Return a markdown table with columns: Theme, Mentions, Severity, Example Quote. Then write exactly 3 recommendations, one sentence each."

When to skip: exploratory questions ("what should I consider?"), creative brainstorming, or one-off queries where you'll just read and act.

3. Delimiters - separate sections clearly.

Use XML tags, markdown headers, or triple quotes between your instruction, context, and input data. Without them, the AI sometimes confuses what's an instruction and what's content to process.

Applied to our example: wrapping the feedback in <feedback> tags tells the model "this is the data, not part of the instruction." Without it, if a customer wrote "ignore previous instructions" in their feedback (yes, this happens), the model might actually obey it.

4. Role and audience - narrow the model's behaviour.

An LLM is a generalist by default - it draws on everything it was trained on, which is basically the entire internet. Setting a role and audience constrains that enormous solution space to a specific domain, expertise level, and communication style. "You are a senior engineer, I am also senior, skip basics" activates domain-specific knowledge, suppresses beginner-level explanations, and calibrates the output for a professional context. One line of context, completely different answer.

Applied to our example: "You are a product analyst" tells the model to prioritise themes by business impact rather than just frequency - because that's how a product analyst thinks. Without it, you get a generic summary. With it, you get one that's shaped by domain expertise.

The quick reference

Two things worth bookmarking.

Where to start for your task type:

Task type	Start with	If that fails
Text classification	Zero-shot	Few-shot with examples
Summarisation	Zero-shot + output spec	Few-shot with example summaries
Information extraction	Zero-shot + output format	Few-shot with examples
Code generation	Zero-shot + scope boundaries	Few-shot + chaining
Code review	Role + scope	Few-shot with example reviews
Reasoning / math	Zero-shot CoT	Few-shot CoT
Complex multi-step	Prompt chaining	Add CoT within each step
Docs / reports	Output specification	Few-shot with example docs

When your prompt doesn't work:

Problem	Fix
Wrong format	Specify output shape
Inconsistent results	Add 2-5 examples (few-shot)
Wrong reasoning	"Let's think step by step" (CoT)
AI invents things you didn't ask for	Add scope boundaries
Task too complex	Break into smaller prompts (chaining)
Conversation went off track	Start fresh
Response too basic or too advanced	Set role and audience
Not reading output before sharing	Human verification always

The prompt diagnostic: instruction, context, input, output indicator. When something fails, one of these is missing or vague. Start there.

The embarrassingly simple conclusion

The AI does exactly what autocomplete would do with your input. Vague input, random output. Specific input, useful output.

Four techniques, a handful of patterns, and a couple of hours of practice. That's the gap between babysitting and actually getting work done.

The AI is not the only one who needs training, we need it too if we want to learn to use it right.

AI makes faster both engineer and chaos

Vaulter Prompt — Sun, 15 Feb 2026 09:07:07 +0000

What industry data on AI for productivity actually tells us

Somewhere around mid-2025, I started noticing something across our engineering org that felt off.

Like everyone else, we were enthusiastically adopting AI coding tools. Team leads were excited. Throughput was up. Sprint delivery and speed numbers looked great on dashboards.

And I was getting more and more nervous.

Not because anyone was doing anything wrong - the enthusiasm was genuine and well-intentioned. But because the signals I was seeing didn't match the celebration. PR sizes growing. More code shipping without meaningful review. Stability metrics dipping. Rework going up. The kind of early warnings that, if you've been in engineering leadership long enough, you know are worth paying attention to.

When I dug in, I realized the enthusiasm was simply outpacing the measurement. The focus was on throughput gains - and those were real - but the metrics for what those gains were doing to review quality, stability, and long-term architecture hadn't caught up yet.

It's a pattern I've since heard about across the entire industry. And honestly, in some cases things are quite bad and look a lot like cargo cult engineering - slapping an "AI in our SDLC" badge on the org and celebrating throughput numbers, without actually thinking through long-term implications or developing a real transformation strategy. A headless chicken with an "AI-powered" label stuck to its side. Running faster than ever. No idea where it's going.

In an application area where quality issues can bring severe reputational damage, that gap between enthusiasm and measurement is something you can't afford to leave open for long.

AI is an amplifier, not (yet) a silver bullet

The 2025 DORA Report - based on nearly 5,000 technology professionals - frames it better than I can: AI's primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.

That's the sentence I wish every CEO would read before forwarding the next "Microsoft writes 30% of code with AI" headline to their CTO.

And by the way - in this interview, a CTO of a developer productivity measurement company traced that "30% of code" claim back to accepted autocomplete suggestions. As she puts it: your linter has been running on 100% of your PRs for years. Can you imagine the headline? "Acme Corp only ships code to production that's been read by robots." We could have said our code was "machine generated" back when IDE autocomplete was filling in class names.

The data is more nuanced than the headlines suggest. Yes, 90% of developers now use AI as part of their work. Yes, AI adoption now improves software delivery throughput - a shift from 2024, where it didn't. But AI adoption still increases delivery instability. Teams are adapting for speed. Their underlying systems have not evolved to safely manage AI-accelerated development.

Faros AI's analysis of over 10,000 developers across 1,255 teams found the same pattern from a different angle: while team-level changes are measurable, there is no significant correlation between AI adoption and improvements at the company level. Throughput, lead time, incident resolution - all flat when you zoom out.

So here's the uncomfortable truth. If your org was already good at delivery - clean architecture, fast feedback loops, strong testing - AI probably makes you better. If your org was already struggling with those things, AI just makes you struggle faster.

The numbers most teams aren't watching

I want to share some specific numbers, because this is where it gets concrete and a bit alarming.

PR size is inflating. Faros AI found a 154% increase in average PR size associated with high AI adoption. Jellyfish, analyzing data from 500+ companies, found that going from low to high AI adoption corresponded to an 18% increase in lines added per PR. And here's why that matters: PRs over 1,000 lines have a 70% lower defect detection rate compared to smaller ones. Detection drops from 87% for small PRs (under 100 lines) to just 28% for large ones.

So we're generating bigger PRs that are harder to review. And predictably...

Review time is ballooning. PR review time increased by 91% in teams with high AI adoption, according to Faros AI. Developers are merging 98% more pull requests - but the human review process hasn't gotten any faster. It's become the bottleneck.

Bugs are going up. A 9% increase in bugs per developer. Not dramatic in isolation. But combine it with bigger PRs and overwhelmed reviewers and you have a compounding quality problem.

Code quality is eroding in ways that don't show up immediately. GitClear analyzed 211 million lines of code (2020-2024) and found that copy-pasted code surged from 8.3% to 12.3%, while code refactoring dropped from 24.1% to just 9.5%. Code churn - new code requiring revision within two weeks - nearly doubled, from 3.1% to 5.7%. Copy-paste exceeded refactored code for the first time in the dataset's history.

And a Stanford-backed study of 100,000+ developers found that AI productivity gains ranged from 30-40% for simple greenfield tasks down to 0-10% for complex brownfield work. For complex tasks in less popular languages, AI actually decreased productivity.

These aren't opinions. This is data from multiple independent research groups, across hundreds of companies and tens of thousands of developers, all pointing in the same direction: AI generates more code, faster. But more code is not the same as better outcomes.

One expert's take on this stuck with me: "Source code is a liability." We're now in a world where it's trivially easy to produce a tremendous amount of it. That should make us more careful, not less.

The real shift: from writing code to reviewing it

Here's the thing most people miss about AI in development, and it's what I've come to think of as the most important mental model for engineering leaders to internalize.

AI shifts the developer's cognitive focus from writing to reviewing.

Typing speed was never the bottleneck. On the best day, developers spend maybe 20-25% of their time actually writing code (there's an AWS study that found 20% for their average engineer). AI makes that 20% faster - great - but it doesn't make the other 80% disappear.

What actually happens: AI generates code at superhuman speed, but someone still has to review that code, understand it, verify it's correct, and make sure it fits the existing architecture. That "someone" is your developer, who now spends less time in the creative act of writing (which, by the way, is the part most developers enjoy) and more time in the cognitively demanding act of reviewing AI-generated output.

DORA's research found that many developers actually feel less satisfied after AI adoption because AI accelerates the parts they enjoy - and what's left is more toil, more meetings, more review work. Which, if you think about it, is kind of the opposite of the promise.

And this isn't just a feeling. The METR study found that experienced open-source developers took 19% longer to complete tasks with AI, despite believing they were 20% faster. The cognitive overhead of reviewing and correcting AI output ate the time savings and then some.

So here I formulated for myself another important rule I try to follow: AI makes everything faster. Also, it makes chaos faster. If your developers are generating 2x the code but your review process hasn't evolved, you're not being more productive. You're accumulating risk at 2x the rate.

What data-driven AI adoption actually looks like

So what should you actually do? I want to share a reasoning framework rather than a prescriptive checklist, because every org is different. But these are the questions I've learned (sometimes the hard way) to ask.

Look honestly at your codebase. Is your primary language popular and well-supported by AI models - Python, JavaScript, TypeScript, Java? Or are you working in something more niche? The Stanford data shows this matters enormously. Are you mostly greenfield or brownfield? If you're maintaining a large, mature codebase (which most enterprises are), set your expectations accordingly. AI won't deliver the 30-40% gain you saw in the demo.

Be especially careful with your critical path. That 5% of your code that can break everything, the last 1% of performance optimization on your mobile app, the core domain logic your business depends on - these are exactly the areas where AI's gains are smallest and the cost of errors is highest. Use AI freely for boilerplate. Be very deliberate about using it for the stuff that really matters.

Decompose your cycle time and watch the review phase. This is probably the single most actionable metric. If coding time is going down but review time is going up - and overall cycle time isn't improving - AI is just moving the bottleneck, not eliminating it. That's a signal to invest in review process improvements, not to celebrate faster code generation.

Watch your change failure rate and quality metrics. If CFR is climbing alongside AI adoption, that's your canary. The Faros AI data showed a 9% increase in bugs per developer - for some orgs, that's acceptable. For others (like mine), it's not.

Track PR size and code review quality together. If PRs are growing and meaningful review comments per PR are shrinking, your review process is being overwhelmed. The data on reviewer fatigue with large PRs is pretty clear - extra-large PRs receive fewer meaningful comments, not more.

Validate that planned architecture actually shipped. This one is less about dashboards and more about discipline. AI is very good at generating code that works right now and very bad at maintaining long-term architectural coherence. If you have no way to verify that generated code actually follows your intended architecture, you'll discover the drift six months later when something breaks and nobody understands why.

The organizations getting this right - and this is consistent across DORA's research, industry case studies, and I can also see other reports share that one trait: they treat AI adoption as an experiment, not a mandate. They set a baseline, define hypotheses, measure the results. They don't just hand out licenses and hope.

To conclude, I still don't have all of this figured out for myself yet, as the more long-term impact is yet to be seen. Not to mention how fast the industry changes. But I think I've seen enough to know that the teams measuring are the ones making progress - and the teams riding the hype are the ones who'll be untangling the mess six months from now.

Sources referenced:

Hope that prompt works...

Vaulter Prompt — Sun, 15 Feb 2026 08:57:00 +0000

Test your prompts like you test your software if you want AI to facilitate you.

I was sitting with a senior engineer who was very happy about his team's "AI adoption." They went from five pull requests a week to eight, sometimes nine. He was showing me the sprint velocity chart and honestly, it looked great.

Then I asked if I could look at their DORA metrics instead.

Code review time had actually nearly doubled. The amount of lines to be reviewed grew as well (roughly +90%). And the rework - the percentage of code that gets deleted within 21 days of being written - had climbed from 5% to 25%.

That's one in four lines his team wrote last month. Already gone.

So the "speed" increase was actually just code that gets thrown away. More PRs, sure, but also more bugs, more time reviewing, and a lot of code that only existed long enough to create problems before someone deleted it.

That changed the conversation pretty quickly.

What happened there is actually not uncommon. The developers weren't doing anything wrong on purpose. They were sharing prompt snippets in Slack, copying Cursor rules from blog posts, using ChatGPT templates they found on Reddit. Nobody tested any of it. And by the time the metrics made the problem obvious, weeks of engineering time and money were already lost.

It's actually sad how often teams realise this too late. So I hope this post helps you avoid the expensive version of the same lesson.

The reality check: engineering or gambling?

At this point you might have written some prompts. Probably to generate unit tests or as a part of a customer support chat bot. They work (most of the time), but sometimes produce different results. That's not surprising given we use non-deterministic system under the hood. That doesn't feel exactly right. We probably want to be certain in our software (especially critical paths of it).

So for starters I want to look at some of the reasons why testing prompts is important. Apart from obvious ones like unpredictable regression caused by API changes (OpenAI retired GPT-4o and three other models from ChatGPT in February 2026 alone) or just general consequences of baked in non-determinism there are things like:

Hidden biases brought by models from training data (when model "ignores" instructions in prompt). Your prompt says one thing, the model's priors say another.
Debugging difficulty caused by the fact it's hard to isolate root cause without tracking full context (which is a lot - input, prompt version, model version, parameters).
Hard to catch garbage in response, caused by the simple fact that ambiguous prompts still produce an output, it's just unreliable. Unlike e.g. syntax errors which break code immediately (these will surface very quickly). In April 2025, OpenAI pushed a system prompt update that made GPT-4o excessively flattering for its 500 million weekly users. They later admitted they "focused too much on short-term feedback." It took days and a social media firestorm before they rolled it back. That system prompt change wasn't treated as a release candidate. Nobody tested it.

GitClear's 2025 analysis of 211 million lines of code actually found that AI output creates what they called an "illusion of correctness" - the visual neatness and consistent style of generated code caused developers to trust it without thorough validation. Their data showed review participation fell nearly 30%.

And well, if OpenAI doesn't always get this right, the "prompts" your team shares without any control probably need some curation too.

Prompts are code. Yes, even that ChatGPT one.

So here I formulated for myself a few simple principles that help me a lot and hopefully will be useful for you too:

Prompts are code. Version them, review them and run automated evals on every change. Like you always do with your code.
You haven't written all the instructions. When working with a "black boxed system" your input and instructions (while critical) do not define 100% of the output. Hidden biases often surface on edge cases. Diverse test inputs catch what prompt tweaking alone doesn't.
"Works on my machine" proves nothing. A prompt that works today may silently break after a model update. Only automated, repeatable tests can give you confidence.
Garbage in, garbage out. Prompt quality is the most critical success factor. Test your inputs even more than your outputs.
Can't reproduce - can't debug. You have to keep full context 100% of the time including inputs, params, model version and more in your control to be able to catch an issue reliably.

And there's one important thing most people miss: these principles apply not just to product prompts.

Think about it. The ChatGPT instruction you pasted into your team's wiki. The Cursor rules file your tech lead shared on Slack. The "system prompt for code review" your org adopted from a conference talk. If multiple people use it and nobody's actually measured whether it works - well, you have untested code running in production. You just don't call it that.

That gap between perception and reality is the whole problem.

Rule of thumb I use: if it lives in a file and runs more than once - it needs testing.

Prompt testing techniques

So if we need to start testing prompts, where do we begin?

Unfortunately direct assertions like assertEqual(LM_response, expected) won't work. You may have many valid outputs, and even if you set some abstract temperature=0, you are not guaranteed consistent outputs across your runs. Quality lives on a spectrum - relevance, coherence, accuracy - not a binary pass/fail.

What still works from traditional testing?

Good news - many "traditional" principles are still applicable:

CI/CD integration. Automated pipelines, shifting tests left, running evals on every PR are still your best friends.
Tests setup & structure. Didn't change much either, so fixtures, datasets, parameterised tests and pytest-style structure are still applicable.
Structure (code-based) assertions. Validating output format (JSON/XML/CSV), mandatory fields, type constraints - fast, cheap and catching a lot of issues early. Do not underestimate.
Heuristic (rule-based) checks. Also very good and catch ~30-40% of issues from my experience. They are good for things like output length constraints (too short = incomplete, too long = verbose), required elements (must contain XYZ, must have 3+ bullet points) etc.

Some quick examples:

# Heuristic checks
def test_response_basics(response: str):
    assert len(response) > 50, "Response too short - likely incomplete"
    assert len(response) < 2000, "Response too verbose"
    assert "disclaimer" in response.lower(), "Missing required disclaimer"
    assert response.count("•") >= 3, "Must contain at least 3 bullet points"

# ...

# Code-based (structure) assertions
import json
from jsonschema import validate

schema = {
    "type": "object",
    "required": ["summary", "action_items", "priority"],
    "properties": {
        "summary": {"type": "string", "minLength": 10},
        "action_items": {
            "type": "array",
            "minItems": 1,
            "items": {"type": "string"}
        },
        "priority": {"enum": ["low", "medium", "high"]}
    }
}

def test_output_structure(llm_response: str):
    parsed = json.loads(llm_response)  # fails if not valid JSON
    validate(instance=parsed, schema=schema)  # fails if schema mismatch

The "new" stuff

So the gap that traditional testing can't cover gets addressed with other techniques. Here are the main ones:

Semantic similarity allows to validate if the response conveys the same meaning (actual vs expected). If I simplify - it converts actual response and expected output to vectors and measures cosine similarity between them. BERTScore handles this locally - no API calls, no cost per evaluation.

from bert_score import score as bert_score

def test_semantic_similarity(actual: str, expected: str):
    P, R, F1 = bert_score(
        [actual], [expected],
        lang="en",
        model_type="microsoft/deberta-xlarge-mnli"
    )
    assert F1.item() > 0.78, f"Semantic drift detected: {F1.item():.3f}"

Use it when you have "golden sample" outputs and need to detect drift. Typical threshold sits around 0.7-0.85. Tuning that number is where the art comes in (I usually start at 0.75 and adjust based on what my golden samples actually score).

LLM-as-judge is a method where another LLM scores output against certain criteria. It's mostly used for evaluation of response quality aspects like relevancy, correctness, tone, etc. The judge evaluates the response and returns a score in [0; 1] range, which can be tracked and used as quality gateway (with threshold).

From my experience other techniques are a bit more intuitive, and LLM-as-judge raises more questions, which is why I want to go deeper into it.

LLM-as-judge: a closer look

For this section I want to use an example of a dummy SDLC "agent" which does TDD (is really just two small prompts). For demonstration I'm going to use DeepEval (no specific reasons, just used to it, other solutions are just as good).

Defining metrics

First, we define what we measure. Instead of asking "is this good?" we evaluate against specific criteria so "the judge" knows what to score.

For example for the code gen prompt, I'm checking if the generated code satisfies the UTs (as it's kind of a TDD process):

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

code_satisfies_tests = GEval(
    name="CodeSatisfiesTests",
    criteria="Implementation would make all provided test cases pass.",
    evaluation_params=[
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.7
)

GEval is a DeepEval built-in that allows to create custom metrics. The technique comes from a paper called G-Eval, which showed 0.514 Spearman correlation with human judgments - the highest of any automated method at the time.

Another measurement that is good to show is alignment - on a high level it checks whether a prompt is able to generate output that follows the instructions:

from deepeval.metrics import PromptAlignmentMetric

aaa_pattern = PromptAlignmentMetric(
    prompt_instructions=[
        "Each test follows Arrange-Act-Assert pattern",
        "Each test has a single assertion"
    ],
    threshold=0.7
)

Score here is calculated as instructions_followed / total_instructions. I.e. if the prompt says "one assertion per test" and "use AAA pattern," the metric checks both explicitly. And the threshold is the pass/fail line to have an exception thrown (I always start at 0.5-0.7 and tighten based on data).

Writing the test

This part might be actually more familiar, as the structure mirrors pytest a lot. assert_test evaluates your LLM output against the metrics:

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from src.metrics.code_gen import code_satisfies_tests
from src.prompt_runner import run_prompt
from datasets.tdd_prompts import create_code_gen_dataset

def test_code_generation():
    dataset = create_code_gen_dataset()

    for golden in dataset.goldens:
        actual_output = run_prompt("code-gen.md", {
            "task": golden.input,
            "design": golden.additional_metadata.get("design", ""),
            "spec": golden.additional_metadata.get("spec", "")
        })

        test_case = LLMTestCase(
            input=golden.input,
            actual_output=actual_output,
            expected_output=golden.expected_output
        )

        assert_test(test_case, [code_satisfies_tests])

But here's the trade-off: LLM-as-judge is slower and more expensive than heuristics and code assertions. So don't reach for it first. Use heuristics and code assertions for the cheap wins. Add semantic similarity when you have golden samples. Reserve LLM-as-judge for the quality criteria you genuinely can't check with code.

Component-level: when one prompt isn't enough

While end-to-end evals are good for one prompt, there are cases where it's not enough - multi-step pipelines, agents, RAGs, etc. When one fails, you need to know which exact component broke.

So here I formulated for myself another important rule I try to follow:

Monolithic prompts that do multiple things are hard to debug. Split them into focused steps (prompt chaining) each with clear inputs and outputs. Test the parts of it independently, so when something fails, you know exactly where.

DeepEval's @observe decorator creates individually testable steps - spans, where each gets its own metrics. A full execution creates a trace containing all spans:

from deepeval.tracing import observe

@observe(type="llm")
def generate_tests(task, design):
    return run_prompt("test-gen.md", {"task": task, "design": design})

@observe(type="llm")
def generate_code(task, design, spec):
    return run_prompt("code-gen.md", {"task": task, "design": design, "spec": spec})

@observe(type="agent", metrics=[code_satisfies_tests, aaa_pattern])
def tdd_flow(task, design):
    spec = generate_tests(task, design)
    code = generate_code(task, design, spec)
    return code

Start here, not everywhere

Don't try to boil the ocean. Pick one prompt - the one that matters most. Pick one or two metrics. Start with a threshold of 0.5-0.7 and tighten based on actual data, not gut feel.

Generally not to get lost, here's the rule of thumb I rely on: use code assertions and heuristics first as they are fast, cheap and catch ~30% of issues. Use embedding-based semantic similarity (e.g. BERTScore) if you have your "golden sample" data set + edge cases. Rely on LLM-as-judge for quality criteria that you can't check with code - and make those criteria specific.

And avoid these mistakes, because I've made every single one:

Overfitting to test cases. Your prompt works flawlessly with your five examples but falls apart on real inputs. Test data ≠ all possible inputs.

Thresholds too high. Setting 0.95 as your pass/fail line and then wondering why good outputs keep failing. LLM-as-judge has variance. Start at 0.5-0.7, tighten based on data.

Testing only happy paths. Clean inputs pass. Production inputs break. Include edge cases: empty inputs, malformed data, languages you didn't plan for.

Skipping human review. Automation catches regressions, not nuance. Use both: automated evals for CI, periodic expert review for calibration.

Testing too late. Finding issues after your prompt is deployed to users, not before. Shift left. Run evals on every PR.

Measuring everything at once. Ten metrics from day one, none properly tuned. Start with one or two critical criteria. Add more when those are stable.

The hard part isn't writing the test. It's defining what "good" means for your specific case. Once you have that, the tooling is the easy part.

So here's the simple rule: test any prompt that is shared with someone or will be used more than once.

That's it. Not just your product prompts. Your Cursor rules. Your ChatGPT templates. The "just paste this into Claude" messages on Slack.

This is a new reality, and I'd probably call it AI hygiene. Same as you wouldn't share untested code with your team, you probably shouldn't share untested prompts either.