Making Sure Your Prompt Will Be There For You When You Need It

#ai #testing #softwareengineering #promptengineering

At Google, our team (Google Cloud Samples) uses Gemini to produce thousands of samples in batches. In doing so, we've learned that the biggest hurdle isn't the AI, it's our own expectations about these tools. As developers, we are wired for deterministic systems: we call a function and it produces the same result for the same input every time. This predictability allows for standard unit tests.

Large Language Models (LLMs) however, are probabilistic and stochastic. They don't store facts; they store the likelihood of patterns and use a "sophisticated roll of the dice" to choose the next token. This is why the same prompt can yield a “sparkly” ✨ success one minute and a hallucination 🤪 the next. You aren't just testing code anymore; you are forecasting the weather of your system. To move to production, we must build containment structures (like quality gates and evaluators) that make the unpredictability manageable.

LLMs Can Make Mistakes

Trying to make samples in large batches is different from asking for a single sample from a tool like Gemini CLI. When producing many samples at once, we see more mistakes because the statistics catch up with us. A small percentage of bad samples becomes a large number once the overall number of samples gets higher, not unlike issues in manufacturing. Here are some examples of mistakes.

Sometimes we detect code with syntax issues, like the def def snippet below. Python uses only one def keyword to specify the start of a function definition.

def def create_secret_with_expiration(
    project_id: str, location: str, secret_id: str
):

Syntax issues like this can be detected with linting or other build tools. If we detect them in our pipeline, we can just regenerate the sample. Other times the issues are more subtle, like how this JSDoc below is 7 lines away from the function it is documenting, separated from its function by a use statement, imports, and an object instantiation.

/**
 * Get secret metadata.
 *
 * @param projectId Google Cloud Project ID (such as 'example-project-id')
 * @param secretId ID of the secret to retrieve (such as 'my-secret-id')
 */
'use strict';

const {SecretManagerServiceClient} = require('@google-cloud/secret-manager');
const {status} = require('@grpc/grpc-js');

const client = new SecretManagerServiceClient();

async function getSecretMetadata(projectId, secretId) {

Or other times the docstring is incorrect, like how the the docstring below is missing parameters used by the function it documents.

def create_secret_with_notifications(
    project_id: str, location: str, secret_id: str, topics: list[str]
) -> None:
    """Create Secret with Pub/Sub Notifications. Creates a new secret resource
    configured to send notifications to Pub/Sub topics. This enables external
    systems to react to secret lifecycle events.

    Args:
        project_id: The Google Cloud project ID. for example,
            'example-project-id'
        location: The location of the resource. for example, 'us-central1'
    """

Issues don’t always show up directly in code, either. We have Gemini generating build artifacts, like package.json. In the case below, it was so eager to include the gRPC package that it listed the package 3 times in different ways, including one that has been deprecated.

{
  "name": "example",
  "private": true,
  "description": "Google Cloud Platform Code Samples 🎒",
  "dependencies": {
    "@google-cloud/secret-manager": "latest",
    "@grpc/grpc-js": "latest",
    "@grpc/grpc-js": "^1.10.0",
    "grpc": "latest"
  },
  "scripts": {
    "test": "node --test"
  }
}

We have other, more subtle issues as well. Sometimes the code is correct, but not saved with the correct filename or in the correct folder structure. Issues like these lead to more manual evaluation and testing. By iterating on prompts with evaluation we have improved our results.

Prompt Templates as Functional Interfaces

Quality responses are guided by the three elements shown above: the input data, a prompt template, and the LLM itself. As part of prompting for production, we’re evaluating prompt templates, like those created with the dotprompt format. Below is a very simple example of a prompt template in dotprompt. Using the prompt template we can reuse the same prompt text over and over with different inputs. Prompt templates give us a function interface for interacting with the LLM.

---
model: gemini-3-flash-preview
input:
  schema:
    need: string
    language: string
output:
  schema:
    code: string
---

Generate code that satisfies the need of {{ need }} using language {{ language }}.

By using templates, we can run the same logic across hundreds of different inputs to see where the "weather" changes.

We've found that a successful workflow follows these phases:

Build a foundation with Ground Truth
Finding Your Candidate Prompt (Vibe Check)
Statistical Trials – Because Unit Tests Alone Don’t Work

Phase 1: Build a foundation with Ground Truth

In the prompt template world, the template is only part of the picture. We need the input values as well. We also need the matching expected output values. You may say “But this sounds like unit testing!” and you would be right; it is a similar idea. The amount of testing data you need depends on what question you want to answer. If your question boils down to “Is the prompt template bad?” then 5-10 records of input/output test data is good enough. This will help you eliminate a bad prompt template quickly. If your question is more “Will my prompt template work well?” then you need 50 - 100. The more edge cases you can insert into your test data, the better.

Fortunately, we have a golden set of samples we can use as known good testing data. We continue to iterate on our test data while also adding more samples to it.

Phase 2: Finding Your Candidate Prompt (Vibe Check)

Before you share with your team, start experimenting by using a tool like Google AI studio to develop some handmade prompts. Try them with different inputs and outputs. Build an intuition for what works and what doesn’t. Use Gemini to help in your evaluation.

AI Studio’s playground can be very helpful at this stage, including providing structured outputs that can then be used to help plan the outputs used in our dotprompt file. When you feel good about your results, you have anecdotal evidence that your prompt template might work, but not statistical evidence.

Phase 3: Statistical Trials – Because Unit Tests Alone Don’t Work

Does your candidate prompt template work with many different inputs? This is where things get more complex and we move from the familiar deterministic unit testing to probabilistic testing. Because the LLM could answer differently each time, we need to run multiple trials for each input/output test record. But how many is enough? For recent academic work, my previous team ran as many as 128 times per input/output pair for better statistical relevance, but this gets expensive fast. To balance cost, time, and effort, the community consensus is either four and five times per input/output test record. The argument for five over four is that we need an odd number to “break ties.”

But how do you know if the output of your prompt is working well? Use a deterministic metric. In the case of samples, we build the code, we lint it, we apply other static analysis tools, which all provide deterministic review and feedback. Finally, once we have something that passes those quality gates, we perform manual testing and human review. With this many quality gates and a large number of samples, we can begin to rely on the Law of Large Numbers to determine if a prompt template is working and not worry about four or five trials per sample.

Embracing Statistical Techniques For The Best Performance

Beyond prompt templates, we can evaluate other parts of our workflow. The scenarios below show how we can freeze some elements of the workflow while keeping others the same (freeze). We start by listing the question we want to answer and then list which elements to change and which elements to freeze.

How well does my new prompt template work?
1. Change: prompt template
2. Freeze: model, hyperparameters, ground truth input and output
How well does a different model or model version affect the results?
1. Change: model
2. Freeze: hyperparameters, ground truth input and output, prompt template
Is a new input value a useful addition to the ground truth?
1. Change: input value
2. Freeze: model, hyperparameters, ground truth output, prompt template
Is a new output value a useful addition to the ground truth?
1. Change: output value
2. Freeze: model, hyperparameters, ground truth input, prompt template
How will changing the hyperparameter values improve the results?
1. Change: hyperparameter value
2. Freeze: model, ground truth input and output, prompt template

Say a new model version is released and we have results from testing the previous model. We can keep the hyperparameters, ground truth, and prompt template the same as before. Then we change the model in the dotprompt file and rerun our evaluation. Now we have data to decide if we want to use the new model version. Likewise, we can alter the other items in the list above to answer other questions.

We might be able to sidestep the statistical testing by forcing Gemini behave more deterministically. We could set its hyperparameters to their most deterministic values – such as temperature at 0, top-k at 1, top-p at 0, or by using the same seed the same value every time. This creates its own issues, and does not rid us of the need for testing. What if a given prompt’s deterministic response is incorrect every time? How do we automatically correct things for which there are no deterministic tools? We want there to be some degree of creativity and stochasticity in its responses. We want the option of running the generation again with the probability of getting a better response. We embrace this power but we also need to be more statistics-minded about our testing to make sure our prompts are there for us when we need them.

Join the Conversation

I’m curious about what others are doing to help evaluate their prompts and prompt templates.

Are you just starting out? How do you do your vibe checks? How do you test before shipping?
Have you been evaluating prompts for a while? How many times do you evaluate a prompt template before putting it into production? How do you keep time and cost down?
What recommendations do you follow when testing prompts? Do you have sources to share? Can we do this better?
What workflows have you found to work?

Please share in the comments.