<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shawn Jones</title>
    <description>The latest articles on DEV Community by Shawn Jones (@shawnmjones).</description>
    <link>https://dev.to/shawnmjones</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3355452%2F425a453f-5366-43b1-bb10-8548e1061220.jpeg</url>
      <title>DEV Community: Shawn Jones</title>
      <link>https://dev.to/shawnmjones</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shawnmjones"/>
    <language>en</language>
    <item>
      <title>Making Sure Your Prompt Will Be There For You When You Need It</title>
      <dc:creator>Shawn Jones</dc:creator>
      <pubDate>Tue, 10 Mar 2026 17:44:00 +0000</pubDate>
      <link>https://dev.to/googlecloud/making-sure-your-prompt-will-be-there-for-you-when-you-need-it-lk7</link>
      <guid>https://dev.to/googlecloud/making-sure-your-prompt-will-be-there-for-you-when-you-need-it-lk7</guid>
      <description>&lt;p&gt;At Google, our team (Google Cloud Samples) &lt;a href="https://adamross.dev/p/prompting-for-production/" rel="noopener noreferrer"&gt;uses Gemini to produce thousands of samples&lt;/a&gt; in batches. In doing so, we've learned that the biggest hurdle isn't the AI, it's our own expectations about these tools. As developers, we are wired for &lt;a href="https://en.wikipedia.org/wiki/Deterministic_system" rel="noopener noreferrer"&gt;deterministic&lt;/a&gt; systems: we call a function and it produces the same result for the same input every time. This predictability allows for standard unit tests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/ai/llms" rel="noopener noreferrer"&gt;Large Language Models&lt;/a&gt; (LLMs) however, are &lt;a href="https://medium.com/@raj-srivastava/the-great-llm-debate-are-they-probabilistic-or-stochastic-3d1cd975994b" rel="noopener noreferrer"&gt;probabilistic&lt;/a&gt; and stochastic. They don't store facts; they store the likelihood of patterns and use a "sophisticated roll of the dice" to choose the next token. This is why the same prompt can yield a “&lt;a href="https://www.wsj.com/tech/ai/how-the-sparkles-emoji-became-the-symbol-of-our-ai-future-e7786eef" rel="noopener noreferrer"&gt;sparkly&lt;/a&gt;” ✨ success one minute and a hallucination 🤪 the next. You aren't just testing code anymore; you are forecasting the weather of your system. To move to production, we must build containment structures (like quality gates and evaluators) that make the unpredictability manageable.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLMs Can Make Mistakes
&lt;/h2&gt;

&lt;p&gt;Trying to make samples in large batches is different from asking for a single sample from a tool like &lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;. When producing many samples at once, we see more mistakes because the statistics catch up with us. A small percentage of bad samples becomes a large number once the overall number of samples gets higher, not unlike issues in manufacturing. Here are some examples of mistakes.&lt;/p&gt;

&lt;p&gt;Sometimes we detect code with syntax issues, like the &lt;code&gt;def def&lt;/code&gt; snippet below. Python uses only one &lt;code&gt;def&lt;/code&gt; keyword to specify the start of a &lt;a href="https://docs.python.org/3/tutorial/controlflow.html#defining-functions" rel="noopener noreferrer"&gt;function definition&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_secret_with_expiration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Syntax issues like this can be detected with linting or other build tools. If we detect them in our pipeline, we can just regenerate the sample. Other times the issues are more subtle, like how this JSDoc below is 7 lines away from the function it is documenting, separated from its function by a use statement, imports, and an object instantiation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * Get secret metadata.
 *
 * @param projectId Google Cloud Project ID (such as 'example-project-id')
 * @param secretId ID of the secret to retrieve (such as 'my-secret-id')
 */&lt;/span&gt;
&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;use strict&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;SecretManagerServiceClient&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@google-cloud/secret-manager&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@grpc/grpc-js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SecretManagerServiceClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getSecretMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;secretId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or other times the docstring is incorrect, like how the the docstring below is missing parameters used by the function it documents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_secret_with_notifications&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secret_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create Secret with Pub/Sub Notifications. Creates a new secret resource
    configured to send notifications to Pub/Sub topics. This enables external
    systems to react to secret lifecycle events.

    Args:
        project_id: The Google Cloud project ID. for example,
            &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;example-project-id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
        location: The location of the resource. for example, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Issues don’t always show up directly in code, either. We have Gemini generating build artifacts, like &lt;code&gt;package.json&lt;/code&gt;. In the case below, it was so eager to include the &lt;a href="https://grpc.io/" rel="noopener noreferrer"&gt;gRPC&lt;/a&gt; package that it listed the package 3 times in different ways, including one that has been deprecated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"private"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Google Cloud Platform Code Samples 🎒"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@google-cloud/secret-manager"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"latest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@grpc/grpc-js"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"latest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@grpc/grpc-js"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^1.10.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"grpc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"latest"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scripts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node --test"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have other, more subtle issues as well. Sometimes the code is correct, but not saved with the correct filename or in the correct folder structure. Issues like these lead to more manual evaluation and testing. By iterating on prompts with evaluation we have improved our results.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt Templates as Functional Interfaces
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdgujtput47i31wr7kmw.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffdgujtput47i31wr7kmw.webp" alt="The LLM alone is not the function. The input data, the prompt template, and the LLM together create your response." width="515" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Quality responses are guided by the three elements shown above: the input data, a &lt;strong&gt;prompt template&lt;/strong&gt;, and the LLM itself. As part of &lt;a href="https://adamross.dev/p/prompting-for-production/" rel="noopener noreferrer"&gt;prompting for production&lt;/a&gt;, we’re evaluating prompt templates, like those created with the &lt;a href="https://github.com/google/dotprompt" rel="noopener noreferrer"&gt;dotprompt&lt;/a&gt; format. Below is a very simple example of a prompt template in dotprompt. Using the prompt template we can reuse the same prompt text over and over with different inputs. Prompt templates give us a function interface for interacting with the LLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemini-3-flash-preview&lt;/span&gt;
&lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;need&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
&lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="s"&gt;Generate code that satisfies the need of {{ need }} using language {{ language }}.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using templates, we can run the same logic across hundreds of different inputs to see where the "weather" changes.&lt;/p&gt;

&lt;p&gt;We've found that a successful workflow follows these phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a foundation with Ground Truth
&lt;/li&gt;
&lt;li&gt;Finding Your Candidate Prompt (Vibe Check)
&lt;/li&gt;
&lt;li&gt;Statistical Trials – Because Unit Tests Alone Don’t Work&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 1: Build a foundation with Ground Truth&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the prompt template world, the template is only part of the picture. We need the input values as well. We also need the matching expected output values. You may say “&lt;em&gt;But this sounds like unit testing!&lt;/em&gt;” and you would be right; it is a similar idea. The amount of testing data you need depends on what question you want to answer. If your question boils down to “&lt;em&gt;Is the prompt template bad?&lt;/em&gt;” then 5-10 records of input/output test data is good enough. This will help you eliminate a bad prompt template quickly. If your question is more “&lt;em&gt;Will my prompt template work well?&lt;/em&gt;” then you need &lt;a href="https://developers.openai.com/api/docs/guides/supervised-fine-tuning" rel="noopener noreferrer"&gt;50&lt;/a&gt; - &lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/eval-python-sdk/evaluation-dataset" rel="noopener noreferrer"&gt;100&lt;/a&gt;. The more edge cases you can insert into your test data, the better.&lt;/p&gt;

&lt;p&gt;Fortunately, we have a golden set of samples we can use as known good testing data. We continue to iterate on our test data while also adding more samples to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 2: Finding Your Candidate Prompt (Vibe Check)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before you share with your team, start experimenting by using a tool like &lt;a href="https://aistudio.google.com/" rel="noopener noreferrer"&gt;Google AI studio&lt;/a&gt; to develop some handmade prompts. Try them with different inputs and outputs. Build an intuition for what works and what doesn’t. Use Gemini to help in your evaluation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36v5coavg9h05tkugqca.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36v5coavg9h05tkugqca.webp" alt="AI Studio can be a useful tool for developing prompts." width="800" height="655"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI Studio’s playground can be very helpful at this stage, including providing structured outputs that can then be used to help plan the outputs used in our dotprompt file. When you feel good about your results, you have anecdotal evidence that your prompt template might work, but not statistical evidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 3: Statistical Trials – Because Unit Tests Alone Don’t Work&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Does your candidate prompt template work with many different inputs? This is where things get more complex and we move from the familiar deterministic unit testing to probabilistic testing. Because the LLM could answer differently each time, we need to run multiple trials for each input/output test record. But how many is enough? For &lt;a href="https://doi.org/10.18653/v1/2025.aisd-main.6" rel="noopener noreferrer"&gt;recent academic work&lt;/a&gt;, my previous team ran as many as 128 times per input/output pair for better statistical relevance, but this gets expensive fast.  To balance cost, time, and effort, the community consensus is either &lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/configure-judge-model" rel="noopener noreferrer"&gt;four&lt;/a&gt; and &lt;a href="https://arxiv.org/html/2502.06233v1" rel="noopener noreferrer"&gt;five&lt;/a&gt; times &lt;em&gt;per input/output test record&lt;/em&gt;. The argument for five over four is that we need an odd number to “break ties.”&lt;/p&gt;

&lt;p&gt;But how do you know if the output of your prompt is working well? Use a deterministic metric. In the case of samples, we build the code, we lint it, we apply other static analysis tools, which all provide deterministic review and feedback. Finally, once we have something that passes those quality gates, we perform manual testing and human review. With this many quality gates and a large  number of samples, we can begin to rely on the &lt;a href="https://www.probabilitycourse.com/chapter7/7_1_1_law_of_large_numbers.php" rel="noopener noreferrer"&gt;Law of Large Numbers&lt;/a&gt; to determine if a prompt template is working and not worry about four or five trials per sample. &lt;/p&gt;

&lt;h2&gt;
  
  
  Embracing Statistical Techniques For The Best Performance
&lt;/h2&gt;

&lt;p&gt;Beyond prompt templates, we can evaluate other parts of our workflow. The scenarios below show how we can freeze some elements of the workflow while keeping others the same (freeze). We start by listing the question we want to answer and then list which elements to change and which elements to freeze.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How well does my new prompt template work?

&lt;ol&gt;
&lt;li&gt;Change: prompt template
&lt;/li&gt;
&lt;li&gt;Freeze: model, hyperparameters, ground truth input and output
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;How well does a different model or model version affect the results?

&lt;ol&gt;
&lt;li&gt;Change: model
&lt;/li&gt;
&lt;li&gt;Freeze: hyperparameters, ground truth input and output, prompt template
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Is a new input value a useful addition to the ground truth?

&lt;ol&gt;
&lt;li&gt;Change: input value
&lt;/li&gt;
&lt;li&gt;Freeze: model, hyperparameters, ground truth output, prompt template
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;Is a new output value a useful addition to the ground truth?

&lt;ol&gt;
&lt;li&gt;Change: output value
&lt;/li&gt;
&lt;li&gt;Freeze: model, hyperparameters, ground truth input, prompt template
&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;li&gt;How will changing the &lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/adjust-parameter-values" rel="noopener noreferrer"&gt;hyperparameter&lt;/a&gt; values improve the results?

&lt;ol&gt;
&lt;li&gt;Change: hyperparameter value
&lt;/li&gt;
&lt;li&gt;Freeze: model, ground truth input and output, prompt template&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;p&gt;Say a new model version is released and we have results from testing the previous model. We can keep the hyperparameters, ground truth, and prompt template the same as before. Then we change the model in the dotprompt file and rerun our evaluation. Now we have data to decide if we want to use the new model version. Likewise, we can alter the other items in the list above to answer other questions.&lt;/p&gt;

&lt;p&gt;We might be able to sidestep the statistical testing by forcing Gemini behave more deterministically. We could set its hyperparameters to their most deterministic values – such as &lt;em&gt;temperature&lt;/em&gt; at 0, &lt;em&gt;top-k&lt;/em&gt; at 1, &lt;em&gt;top-p&lt;/em&gt; at 0, or by using the same &lt;em&gt;seed&lt;/em&gt; the same value every time. This creates its own issues, and does not rid us of the need for testing. What if a given prompt’s deterministic response is incorrect every time? How do we automatically correct things for which there are no deterministic tools? We want there to be some degree of creativity and stochasticity in its responses. We want the option of running the generation again with the probability of getting a better response. We embrace this power but we also need to be more statistics-minded about our testing to make sure our prompts are there for us when we need them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Join the Conversation
&lt;/h2&gt;

&lt;p&gt;I’m curious about what others are doing to help evaluate their prompts and prompt templates.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are you just starting out? How do you do your vibe checks? How do you test before shipping?
&lt;/li&gt;
&lt;li&gt;Have you been evaluating prompts for a while? How many times do you evaluate a prompt template before putting it into production? How do you keep time and cost down?
&lt;/li&gt;
&lt;li&gt;What recommendations do you follow when testing prompts? Do you have sources to share? Can we do this better?
&lt;/li&gt;
&lt;li&gt;What workflows have you found to work?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Please share in the comments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Read More
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A paper on the “Budget 5”: &lt;a href="https://arxiv.org/html/2502.06233v1" rel="noopener noreferrer"&gt;Confidence Improves Self-Consistency in LLMs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/eval-python-sdk/evaluation-dataset#best-practices" rel="noopener noreferrer"&gt;Vertex AI’s advice on evaluation datasets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Anthropic’s &lt;a href="https://www.anthropic.com/engineering/writing-tools-for-agents" rel="noopener noreferrer"&gt;&lt;em&gt;Writing Effective tools for AI agents&lt;/em&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Stanford’s and UC Santa Barbara’s &lt;a href="https://web.stanford.edu/~jurafsky/pubs/2020.emnlp-main.745.pdf" rel="noopener noreferrer"&gt;&lt;em&gt;With LIttle Power Comes Great Responsibility&lt;/em&gt;&lt;/a&gt; about how many NLP studies are underpowered in terms of statistical testing
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/blog/topics/developers-practitioners/7-technical-takeaways-from-using-gemini-to-generate-code-samples-at-scale" rel="noopener noreferrer"&gt;7 Technical Takeaways from Using Gemini to Generate Code Samples at Scale&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://adamross.dev/p/prompting-for-production/" rel="noopener noreferrer"&gt;How My Team Aligns on Prompting for Production&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to &lt;a href="https://dev.to/sigje"&gt;Jennifer Davis&lt;/a&gt;, &lt;a href="https://adamross.dev/" rel="noopener noreferrer"&gt;Adam Ross&lt;/a&gt;, &lt;a href="https://nim.emuxo.com/" rel="noopener noreferrer"&gt;Nim Jayawardena&lt;/a&gt;, and &lt;a href="https://glasnt.com/" rel="noopener noreferrer"&gt;Katie McLaughlin&lt;/a&gt; for feedback on this post.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>softwareengineering</category>
      <category>promptengineering</category>
    </item>
  </channel>
</rss>
