<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manouk Draisma</title>
    <description>The latest articles on DEV Community by Manouk Draisma (@draismaaaa).</description>
    <link>https://dev.to/draismaaaa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2582129%2F8580463f-87b7-4ed2-a4f0-6cce235f0c8f.jpg</url>
      <title>DEV Community: Manouk Draisma</title>
      <link>https://dev.to/draismaaaa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/draismaaaa"/>
    <language>en</language>
    <item>
      <title>From zero evals to a working multimodal evaluation in 30 minutes using LangWatch Skills</title>
      <dc:creator>Manouk Draisma</dc:creator>
      <pubDate>Tue, 24 Mar 2026 15:04:40 +0000</pubDate>
      <link>https://dev.to/langwatch/from-zero-evals-to-a-working-multimodal-evaluation-in-30-minutes-using-langwatch-skills-50a9</link>
      <guid>https://dev.to/langwatch/from-zero-evals-to-a-working-multimodal-evaluation-in-30-minutes-using-langwatch-skills-50a9</guid>
      <description>&lt;p&gt;How I went from "it works on my machine" to measurable agent quality using LangWatch Skills, Jupyter notebooks, and a path to production on AWS.*&lt;/p&gt;

&lt;p&gt;The problem nobody talks about&lt;br&gt;
You built an agent. It uses tools, handles multimodal inputs, answers questions from a knowledge base. You demo it to your team and it works great. Ship it.&lt;/p&gt;

&lt;p&gt;Three days later: the satellite image analysis returns garbage NDVI estimates. The knowledge base tool stops getting called for calibration questions, the LLM just wings it. Nobody noticed because there were no tests.&lt;/p&gt;

&lt;p&gt;This is the gap between "I have an agent" and "I have a reliable agent." LangWatch fills it.&lt;/p&gt;

&lt;p&gt;What I built&lt;br&gt;
The InField Agent is a weather station advisory system built with Strands Agents SDK. It has three multimodal capabilities:&lt;/p&gt;

&lt;p&gt;Knowledge base — calibration procedures for Davis Instruments weather stations&lt;/p&gt;

&lt;p&gt;Station status — fleet inventory, battery health, reporting gaps&lt;/p&gt;

&lt;p&gt;Satellite imagery — NDVI estimation from satellite images using vision models&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands.models.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIModel&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAIModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT_ADV&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_knowledge_base_tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;check_station_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analyze_satellite_image&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The satellite tool sends images to a vision model and gets back structured NDVI data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_satellite_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze a satellite image to estimate NDVI.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;image_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_DATA_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# ... encode image as base64, send to gpt-5-mini with vision ...
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ndvi_estimate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vegetation_cover_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dominant_land_types&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cropland&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grassland&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Healthy vegetation with moderate crop coverage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a non-trivial agent to test. You have text-based retrieval, structured data queries, and multimodal vision analysis all behind the same prompt. Traditional unit tests cover maybe 10% of the failure surface.&lt;/p&gt;

&lt;p&gt;Step 1: Add LangWatch skills&lt;br&gt;
LangWatch ships skills — curated Claude Code instructions that know how to wire up tracing, evaluations, scenarios, and prompt management in your project. Think of them as recipes that understand the LangWatch SDK.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add langwatch/skills/evaluations
npx skills add langwatch/skills/scenarios
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This drops skill files into .claude/skills/ in your project. When you use Claude Code, it picks up these instructions and knows exactly how to scaffold evaluations and scenarios for your specific agent.&lt;/p&gt;

&lt;p&gt;The skills also set up the skills-lock.json to track versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"skills"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"evaluations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"langwatch/skills"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sourceType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"github"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"computedHash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"170c4e99..."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"scenarios"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"langwatch/skills"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sourceType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"github"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"computedHash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"b3afbe5c..."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 2: Tracing — see what your agent actually does&lt;br&gt;
Before you evaluate anything, you need observability. LangWatch tracing captures every LLM call, tool invocation, and input/output pair.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;langwatch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;langwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@langwatch.trace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;InField Agent Turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_turn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;langwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_current_trace&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two lines of setup, one decorator. Every agent turn now shows up in the LangWatch dashboard with the full tool chain visible.&lt;/p&gt;

&lt;p&gt;Step 3: Multimodal experiments in Jupyter&lt;br&gt;
This is where it gets interesting. The evaluations skill guided me toward using Jupyter notebooks with langwatch.experiment for batch testing. The key insight: satellite images can be embedded as markdown in the dataset, and LangWatch renders them inline in the UI.&lt;/p&gt;

&lt;p&gt;The dataset&lt;br&gt;
Each row targets one of the three capabilities. Satellite rows include the actual image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SATELLITE_BASE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;https://storage.googleapis.com/experiments_langwatch&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;image_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;![Satellite image &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;](&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SATELLITE_BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.png)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="c1"&gt;# Knowledge base
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I calibrate the temperature reading on a Vantage Pro2?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use the temperature calibration offset in the console setup menu.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knowledge_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Station status
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which stations have low battery levels?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A list of stations with battery voltage below 3.0V.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;station_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# Satellite — multimodal
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this satellite image and estimate the NDVI.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;image_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An NDVI estimate between -1.0 and 1.0 with vegetation coverage.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satellite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What does this satellite image tell us about vegetation health?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;image_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;03&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An NDVI estimate with vegetation health description.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satellite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimate the vegetation index for this field.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;image_to_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;07&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;An NDVI estimate with vegetation cover and land classification.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satellite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The evaluators&lt;br&gt;
LangWatch supports platform-configured evaluators that you reference by slug. I set up three:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhf0rak5cl0kvuk8xlmsl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhf0rak5cl0kvuk8xlmsl.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The tool usage check is critical. An agent that answers correctly without calling the tool is a hallucination risk, it just happened to get lucky this time.&lt;/p&gt;

&lt;p&gt;The experiment loop&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;experiment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;langwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;infield-agent-multimodal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;notna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer-relevancy-nxwec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer-correctness-b5e6x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;experiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool-usage-check-aljvk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you see in LangWatch&lt;br&gt;
As the notebook runs, results appear in real time:&lt;/p&gt;

&lt;p&gt;Satellite images rendered inline next to scores&lt;/p&gt;

&lt;p&gt;Pass/fail per evaluator per row&lt;/p&gt;

&lt;p&gt;Scores comparable across model versions and prompt changes side by side&lt;/p&gt;

&lt;p&gt;That last point is the payoff. Change a prompt, run the experiment again, see exactly what moved. Not just for one input — across the whole dataset.&lt;/p&gt;

&lt;p&gt;The @langwatch.trace decorator also means every evaluation run produces full traces. Drill into a failing row and see exactly which tool was called, what the LLM received, and where it went wrong.&lt;/p&gt;

&lt;p&gt;Step 4: Simulations — test the agent as a system&lt;br&gt;
Evaluations test isolated input-output pairs. Simulations test multi-turn conversations where the agent interacts with a simulated user.&lt;/p&gt;

&lt;p&gt;LangWatch Scenario is the framework. It has three actors:&lt;/p&gt;

&lt;p&gt;Agent Under Test — your agent&lt;/p&gt;

&lt;p&gt;User Simulator — an LLM that generates realistic user messages&lt;/p&gt;

&lt;p&gt;Judge — an LLM that evaluates the conversation and decides pass/fail&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scenario&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4.1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.asyncio&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_calibration_workflow&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;InFieldAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentAdapter&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentInput&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AgentReturnTypes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calibration guidance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A field technician needs to calibrate barometric pressure on a Vantage Pro2. They have a known reference pressure but aren&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t sure about the procedure.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;InFieldAdapter&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;UserSimulatorAgent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JudgeAgent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;script&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How do I calibrate the barometric pressure?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent used the knowledge base tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent provided step-by-step calibration instructions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;user&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;judge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent answered the follow-up using tool results, not general knowledge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The simulation loop runs automatically: the user simulator generates contextual follow-ups, the agent responds, the judge scores against your criteria. You define the scenario once and it tests the full conversation flow.&lt;/p&gt;

&lt;p&gt;For adversarial testing, swap UserSimulatorAgent for RedTeamAgent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nc"&gt;InFieldAdapter&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;RedTeamAgent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;  &lt;span class="c1"&gt;# Tries to make the agent hallucinate or go off-topic
&lt;/span&gt;    &lt;span class="n"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JudgeAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;criteria&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Agent stays within scope of weather stations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 5: Deploy to AWS&lt;br&gt;
With evaluations passing and simulations green, ship it.&lt;/p&gt;

&lt;p&gt;The InField Agent is a single-turn Q&amp;amp;A system — Lambda is the natural fit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4etvht96y452hn6h5lw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4etvht96y452hn6h5lw.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The knowledge base is bundled inside the Lambda package. LLM inference runs on OpenAI's servers. Lambda just orchestrates the agent loop.&lt;/p&gt;

&lt;p&gt;Lambda handler&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;strands.models.openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIModel&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MODEL_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SYSTEM_PROMPT_ADV&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_knowledge_base_tool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; in request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Package and deploy&lt;br&gt;
Three options, depending on your dependency size:&lt;/p&gt;

&lt;p&gt;ZIP + Strands Layer (simplest):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;aws lambda create-function \\\\
  --function-name infield-agent \\\\
  --runtime python3.12 \\\\
  --handler lambda_handler.handler \\\\
  --zip-file fileb://packaging/app.zip \\\\
  --architectures arm64 \\\\
  --memory-size 256 \\\\
  --timeout 30 \\\\
  --layers "arn:aws:lambda:us-east-1:856699698935:layer:strands-agents-py312-aarch64:1" \\\\
  --environment "Variables={OPENAI_API_KEY=your-key}" \\\\
  --role arn:aws:iam::YOUR_ACCOUNT:role/lambda-execution-role
Container Image (when dependencies exceed 250 MB):

&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; public.ecr.aws/lambda/python:3.12-arm64&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; requirements.txt .&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;LAMBDA_TASK_ROOT&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; lambda_handler.py prompts.py tools.py ${LAMBDA_TASK_ROOT}/&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; knowledge_base/ ${LAMBDA_TASK_ROOT}/knowledge_base/&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["lambda_handler.handler"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production checklist&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store API keys in Secrets Manager or SSM Parameter Store&lt;/li&gt;
&lt;li&gt;Enable CloudWatch logging&lt;/li&gt;
&lt;li&gt;Set up API Gateway with authentication&lt;/li&gt;
&lt;li&gt;Configure CloudWatch alarms on error rate and duration&lt;/li&gt;
&lt;li&gt;Run evaluations in CI before deploying new versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point closes the loop: your Jupyter notebook evaluations become a CI gate. A prompt change that drops answer relevancy below threshold blocks the deploy.&lt;/p&gt;

&lt;p&gt;The evaluation lifecycle&lt;br&gt;
Here is what you end up with:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4fcl0u7nznl9zerz06q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4fcl0u7nznl9zerz06q.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LangWatch covers all four stages with the same evaluators. The answer-relevancy check you run in a notebook is the same one that scores production traces. Consistency across the lifecycle means no surprises.&lt;/p&gt;

&lt;p&gt;What this cost me&lt;br&gt;
Setup time: ~30 minutes. The skills did most of the scaffolding.&lt;/p&gt;

&lt;p&gt;Notebook evaluations: 3 satellite images x 3 evaluators = 9 evaluation calls per run. Under a minute.&lt;/p&gt;

&lt;p&gt;Lambda deployment: 256 MB, arm64, 30s timeout. Pennies at low volume.&lt;/p&gt;

&lt;p&gt;LangWatch traces: free tier covers experimentation. Platform evaluators included.&lt;/p&gt;

&lt;p&gt;Takeaways&lt;br&gt;
Evaluations are not optional for multimodal agents. A satellite image tool that returns plausible-sounding garbage is worse than one that throws an error. You need automated checks.&lt;/p&gt;

&lt;p&gt;Tool usage matters as much as answer quality. An agent that gives the right answer without calling the tool is a ticking time bomb. The tool-usage-check evaluator catches this.&lt;/p&gt;

&lt;p&gt;Simulations find bugs that evaluations miss. Single-turn evaluations cannot test whether the agent stays grounded across a multi-turn conversation. Scenario simulations can.&lt;/p&gt;

&lt;p&gt;LangWatch skills bootstrap the hard part. npx skills add langwatch/skills/evaluations gives Claude Code the context to scaffold everything — the notebook, the evaluators, the experiment loop. You focus on defining what "correct" means for your agent.&lt;/p&gt;

&lt;p&gt;Same evaluators, every stage. Run them in a notebook during development, in CI before deploy, and on live traces in production. One set of quality criteria, applied everywhere.&lt;/p&gt;

&lt;p&gt;All code is available at &lt;a href="https://github.com/langwatch/satellite-agent" rel="noopener noreferrer"&gt;https://github.com/langwatch/satellite-agent&lt;/a&gt;  The Jupyter notebook runs end to end if you have an OpenAI key and a LangWatch project.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>evals</category>
      <category>claudecode</category>
    </item>
    <item>
      <title>Your coding agent already knows how to test your AI agent (we just turned it into a Skill)</title>
      <dc:creator>Manouk Draisma</dc:creator>
      <pubDate>Mon, 23 Mar 2026 20:34:43 +0000</pubDate>
      <link>https://dev.to/draismaaaa/your-coding-agent-already-knows-how-to-test-your-ai-agent-we-just-turned-it-into-a-skill-5f0g</link>
      <guid>https://dev.to/draismaaaa/your-coding-agent-already-knows-how-to-test-your-ai-agent-we-just-turned-it-into-a-skill-5f0g</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e73far993u8lsvjuxjo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6e73far993u8lsvjuxjo.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We’re adding something new at LangWatch: &lt;strong&gt;Skills&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And the idea is pretty simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;your coding agent already knows how to do a lot of the work you’re still doing manually&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You just haven’t packaged it properly yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The frustrating part of building AI agents
&lt;/h2&gt;

&lt;p&gt;If you’ve built an LLM agent recently, you probably recognize this loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you tweak something&lt;/li&gt;
&lt;li&gt;you run a few test conversations&lt;/li&gt;
&lt;li&gt;it &lt;em&gt;seems&lt;/em&gt; better&lt;/li&gt;
&lt;li&gt;you ship it&lt;/li&gt;
&lt;li&gt;something breaks in production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you repeat.&lt;/p&gt;

&lt;p&gt;We’ve been there too.&lt;/p&gt;

&lt;p&gt;It’s not that you don’t &lt;em&gt;know&lt;/em&gt; you need evals, testing, or simulations.&lt;/p&gt;

&lt;p&gt;It’s that doing all of that properly is… a lot.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real work isn’t building, it’s validating
&lt;/h2&gt;

&lt;p&gt;When we started LangWatch, we thought the main challenge was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;getting agents to behave correctly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But in practice, the bigger challenge was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;proving that they behave correctly&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;setting up eval datasets&lt;/li&gt;
&lt;li&gt;writing tests&lt;/li&gt;
&lt;li&gt;simulating real user behavior&lt;/li&gt;
&lt;li&gt;instrumenting pipelines&lt;/li&gt;
&lt;li&gt;understanding failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most of this ends up being:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manual&lt;/li&gt;
&lt;li&gt;repetitive&lt;/li&gt;
&lt;li&gt;easy to skip&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The worst part: testing agents doesn’t look like testing code
&lt;/h2&gt;

&lt;p&gt;Traditional testing breaks down with LLMs.&lt;/p&gt;

&lt;p&gt;You can’t just say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because agents are non-deterministic. The same input can give different outputs, which makes rigid testing fragile ([LangWatch][1]).&lt;/p&gt;

&lt;p&gt;So what do people do instead?&lt;/p&gt;

&lt;p&gt;They “vibe test”.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;try a few examples&lt;/li&gt;
&lt;li&gt;eyeball the results&lt;/li&gt;
&lt;li&gt;hope nothing breaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It doesn’t scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  We already solved part of this (with agents testing agents)
&lt;/h2&gt;

&lt;p&gt;If you’ve seen our earlier work (Scenario), you know we took a different approach:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;use an agent to test your agent&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of fixed inputs/outputs, you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simulate real user behavior&lt;/li&gt;
&lt;li&gt;define success criteria&lt;/li&gt;
&lt;li&gt;let an agent explore and evaluate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes testing much closer to reality.&lt;/p&gt;

&lt;p&gt;But even then…&lt;/p&gt;

&lt;p&gt;You still had to set everything up yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  So we asked: why are we still doing this manually?
&lt;/h2&gt;

&lt;p&gt;At this point, most developers already have a coding agent open all day.&lt;/p&gt;

&lt;p&gt;And those agents are actually pretty good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;writing tests&lt;/li&gt;
&lt;li&gt;structuring code&lt;/li&gt;
&lt;li&gt;following instructions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we started asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;what if we let the coding agent handle the “quality work” too?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just writing features.&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;setting up evals&lt;/li&gt;
&lt;li&gt;creating simulations&lt;/li&gt;
&lt;li&gt;instrumenting systems&lt;/li&gt;
&lt;li&gt;analyzing behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  That’s where Skills come in
&lt;/h2&gt;

&lt;p&gt;We built &lt;strong&gt;LangWatch Skills&lt;/strong&gt; as a way to give your coding agent reusable capabilities.&lt;/p&gt;

&lt;p&gt;A Skill is basically:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a structured way to get your coding agent to do something &lt;em&gt;correctly&lt;/em&gt;, every time&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“generate some code”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“do this properly, following best practices, with full coverage”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What a Skill actually looks like
&lt;/h2&gt;

&lt;p&gt;Under the hood, Skills are closer to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;structured instructions&lt;/li&gt;
&lt;li&gt;workflows&lt;/li&gt;
&lt;li&gt;examples&lt;/li&gt;
&lt;li&gt;best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In general, agent skills are “instruction modules” that extend what an agent can do without retraining it ([philschmid.de][2]).&lt;/p&gt;

&lt;p&gt;They tell the agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;when to apply something&lt;/li&gt;
&lt;li&gt;how to do it&lt;/li&gt;
&lt;li&gt;what good looks like&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What you can do with LangWatch Skills
&lt;/h2&gt;

&lt;p&gt;With Skills, you can tell your coding agent to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;instrument your agent&lt;/li&gt;
&lt;li&gt;generate evaluation notebooks&lt;/li&gt;
&lt;li&gt;create simulation-based tests&lt;/li&gt;
&lt;li&gt;explore production performance&lt;/li&gt;
&lt;li&gt;red-team your system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And instead of figuring out &lt;em&gt;how&lt;/em&gt; to do it…&lt;/p&gt;

&lt;p&gt;…the agent just does it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The shift is subtle, but important
&lt;/h2&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;you write eval code, tests, and infrastructure&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;you review and guide what your agent generates&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You move from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;implementation
to&lt;/li&gt;
&lt;li&gt;coordination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s actually where most of the value is.&lt;/p&gt;




&lt;h2&gt;
  
  
  This is part of a bigger shift: “harness engineering”
&lt;/h2&gt;

&lt;p&gt;There’s a growing idea in the ecosystem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the performance of your agent depends heavily on how you configure it&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just the model.&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tools&lt;/li&gt;
&lt;li&gt;context&lt;/li&gt;
&lt;li&gt;memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;skills&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all part of what some people call the agent “harness” — the system around the model that shapes its behavior ([humanlayer.dev][3]).&lt;/p&gt;

&lt;p&gt;Skills are one of the most powerful (and underused) pieces of that.&lt;/p&gt;




&lt;h2&gt;
  
  
  But Skills aren’t magic
&lt;/h2&gt;

&lt;p&gt;One important thing we’ve learned:&lt;/p&gt;

&lt;p&gt;Skills don’t automatically fix everything.&lt;/p&gt;

&lt;p&gt;In fact, a lot of skills:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;don’t improve performance&lt;/li&gt;
&lt;li&gt;or only help in specific contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Recent research shows many skills have limited impact unless they’re well-designed and properly evaluated ([arXiv][4]).&lt;/p&gt;

&lt;p&gt;So the goal isn’t:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“add more skills”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“add the right skills, and make them actually useful”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;We’re entering a phase where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;building agents is easy&lt;/li&gt;
&lt;li&gt;making them reliable is not&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bottleneck has shifted.&lt;/p&gt;

&lt;p&gt;And the teams that win won’t just be the ones who:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;build faster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the ones who:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;validate better&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;iterate faster with confidence&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What we’re aiming for
&lt;/h2&gt;

&lt;p&gt;With Skills, the goal is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;reduce the amount of manual work required to build reliable AI systems&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;wiring pipelines&lt;/li&gt;
&lt;li&gt;writing eval scaffolding&lt;/li&gt;
&lt;li&gt;guessing what broke&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;delegate&lt;/li&gt;
&lt;li&gt;review&lt;/li&gt;
&lt;li&gt;improve&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  SWould love feedback
&lt;/h2&gt;

&lt;p&gt;This is a new direction for us, and we’re still figuring out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What makes a “good” Skill?&lt;/li&gt;
&lt;li&gt;Where do Skills break down?&lt;/li&gt;
&lt;li&gt;What should be automated vs controlled?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re working on LLM agents, I’d love to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how you’re handling evals today&lt;/li&gt;
&lt;li&gt;what’s still painful&lt;/li&gt;
&lt;li&gt;what you’ve tried that didn’t work&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try it out
&lt;/h2&gt;

&lt;p&gt;If this resonates, you can check out what we’re building here:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://langwatch.ai/blog/langwatch-skills-your-coding-agent-already-knows-how-to-test-your-agent?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;LangWatch Skills&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Your coding agent is already capable of doing much more than we typically ask of it.&lt;/p&gt;

&lt;p&gt;Skills are just a way to unlock that.&lt;/p&gt;

&lt;p&gt;The interesting question now is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;what else are we still doing manually that agents could handle better?&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>agents</category>
      <category>agentskills</category>
      <category>evals</category>
      <category>simulations</category>
    </item>
    <item>
      <title>Testing AI agents with domain-driven TDD</title>
      <dc:creator>Manouk Draisma</dc:creator>
      <pubDate>Thu, 02 Oct 2025 16:13:49 +0000</pubDate>
      <link>https://dev.to/draismaaaa/testing-ai-agents-with-domain-driven-tdd-4o18</link>
      <guid>https://dev.to/draismaaaa/testing-ai-agents-with-domain-driven-tdd-4o18</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;br&gt;
Traditional testing doesn’t work well for AI agents (LLMs are nondeterministic, brittle to assert). I built a flight booking agent from scratch using Scenario, a framework we built for running agent simulations + LLM evaluations. Writing scenario tests first (domain-driven TDD) gave me a way to discover domain rules, evolve the model, and ship an agent with confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why testing AI agents is hard&lt;/strong&gt;&lt;br&gt;
Normal unit/integration tests fall apart with AI systems.&lt;br&gt;
Same input, different outputs&lt;br&gt;
Hard to assert strings reliably&lt;br&gt;
You don’t just care about text, you care about business outcomes&lt;/p&gt;

&lt;p&gt;Without testing, you’re basically flying blind.&lt;/p&gt;

&lt;p&gt;That’s why I tried a scenario-driven approach: define business capabilities first, then let the failures tell me what’s missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 –&lt;/strong&gt; Write a scenario test&lt;/p&gt;

&lt;p&gt;Start with a business journey, not code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`const result = await scenario.run({
  setId: "booking-agent-demo",
  name: "Basic greeting test",
  description: "User wants to book a flight and expects polite interaction",
  maxTurns: 5,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "The agent should greet politely",
        "The agent should understand the user wants to book a flight",
      ],
    }),
  ],
  script: [scenario.proceed(5)],
});
`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First run = red, as expected. That failure became my to-do list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2&lt;/strong&gt; – Let failures guide the build&lt;/p&gt;

&lt;p&gt;No endpoint? → Build a basic HTTP route.&lt;/p&gt;

&lt;p&gt;Agent replies static text? → Hook up an LLM.&lt;/p&gt;

&lt;p&gt;LLM forgets context? → Add conversation memory.&lt;/p&gt;

&lt;p&gt;Each test failure = missing piece of the domain.&lt;/p&gt;

&lt;p&gt;Step 3 – Scale up to a full booking journey&lt;/p&gt;

&lt;p&gt;Once greetings + memory worked, I moved to full booking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;`const result = await scenario.run({
  setId: "booking-agent-scenario-demo",
  name: "Complete flight booking",
  description: "User books NY → London with all required details",
  maxTurns: 100,
  agents: [
    scenario.userSimulatorAgent(),
    agentAdapter,
    scenario.judgeAgent({
      criteria: [
        "Collect passenger info",
        "Collect dates + airports",
        "Create booking in system",
        "Confirm booking to user",
      ],
    }),
  ],
  script: [scenario.proceed(100)],
});`
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ran it → failed. Checked DB → no bookings (because I hadn’t built the tools). Wrote tools → ran again → bookings created, but airport codes didn’t match. Another hidden domain rule uncovered.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this worked&lt;/strong&gt;&lt;br&gt;
Scenarios = living documentation of the domain&lt;br&gt;
Failing tests = backlog of missing business rules&lt;br&gt;
Confidence = I can change prompts, swap models, or go multi-agent without fear of silent regressions&lt;/p&gt;

&lt;p&gt;Takeaways&lt;/p&gt;

&lt;p&gt;Scenario = domain-driven TDD for AI agents&lt;/p&gt;

&lt;p&gt;You don’t just test outputs, you validate business outcomes&lt;/p&gt;

&lt;p&gt;Each failure teaches you something new about your domain&lt;/p&gt;

&lt;p&gt;Scenarios double as specs + tests + onboarding docs&lt;/p&gt;

&lt;p&gt;If you’re curious: &lt;a href="https://github.com/langwatch/scenario" rel="noopener noreferrer"&gt;Scenario is open source&lt;br&gt;
&lt;/a&gt; (works with any LLM/agent framework).&lt;/p&gt;

</description>
      <category>testing</category>
      <category>softwaredevelopment</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Agent Simulations are the new Unit Tests for AI</title>
      <dc:creator>Manouk Draisma</dc:creator>
      <pubDate>Wed, 09 Jul 2025 11:42:12 +0000</pubDate>
      <link>https://dev.to/langwatch/why-agent-simulations-are-the-new-unit-tests-for-ai-1n29</link>
      <guid>https://dev.to/langwatch/why-agent-simulations-are-the-new-unit-tests-for-ai-1n29</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxp7m9eiirjtit2995js.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxp7m9eiirjtit2995js.png" alt=" " width="800" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From Self-Driving Cars to AI Agents&lt;/p&gt;

&lt;p&gt;If you've followed the development of autonomous vehicles (AVs), you know that simulation is a non-negotiable part of the process. Companies like Waymo and Cruise don't just rack up millions of miles on physical roads; they drive billions of miles in virtual worlds. This isn't just for fun. It's a core engineering solution to a fundamental machine learning problem: the long tail.&lt;/p&gt;

&lt;p&gt;The real world is messy and unpredictable. The most dangerous driving scenarios are, thankfully, also the rarest. You can drive for years without encountering a tire blowout on a crowded highway or a pedestrian chasing a ball from between two parked cars. Relying on real-world data alone to train an AV would mean you'd have almost no data on the most critical events. As a result, your model would be unprepared.&lt;/p&gt;

&lt;p&gt;This is exactly the kind of challenge we now face with the new generation of AI agents. To build agents that can reliably operate software, browse the web, or manage workflows, we need to adopt the same playbook: agent simulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is AI Agent Simulation?&lt;/strong&gt;&lt;br&gt;
AI agent simulation refers to creating controlled, repeatable environments to test how an autonomous AI agent handles complex or rare scenarios — before deploying it in the real world.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulations in Autonomous Vehicles (AVs): A Blueprint for AI Agent Testing&lt;/strong&gt;&lt;br&gt;
Simulations are the key to make AVs succeed. In places like San Francisco, a large part of the population takes a Waymo instead of an Uber, which, as you might have guessed, is a fully autonomous car. Last week, A Tesla Model Y completed its first fully autonomous delivery, driving itself from the factory to the customer's location without any human intervention.&lt;/p&gt;

&lt;p&gt;Now, how does this work? A simulation platform for AVs has three core components:&lt;br&gt;
&lt;strong&gt;Sensor Simulation&lt;/strong&gt; : The system generates realistic data for all the car's sensors. This includes simulating the precise patterns of light from a LiDAR sensor bouncing off a wet road, the noise in a camera feed at dusk, or the signal degradation of RADAR in a snowstorm.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physics Engine&lt;/strong&gt; : This governs the rules of the virtual world. It ensures vehicles have realistic acceleration and braking, that weather affects traction, and that lighting changes accurately with the time of day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario Generation&lt;/strong&gt; : This is the most crucial part. Engineers can programmatically create and vary critical scenarios an infinite number of times. They can test what happens if a cyclist swerves a little earlier, if a traffic light is partially obscured by a tree branch, or if another car runs a red light.&lt;/p&gt;

&lt;p&gt;The key insight here is that simulation allows you to control the data distribution. You don’t need to drive hours on a boring highway, because, after a while, the model has learned how to do this. You can focus on the events that are most important for safety and robustness, generating millions of permutations of these "long-tail" events. This is how you build a model that doesn't just work 99% of the time, but is prepared for the critical 1%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How AI Agent Simulations help tackle the long tail problem&lt;/strong&gt;&lt;br&gt;
Before diving into how simulations in AI agents work, let’s briefly talk about AlphaGo. AlphaGo mastered the game ‘Go’ (a much harder game than chess). It achieved this remarkable feat not by just studying human games, but by playing millions of games against itself in a simulated world. This process, known as self-play, allowed AlphaGo to explore a vast number of strategies and counter-strategies, far beyond what any human could ever play in a lifetime. It learned the rules of the game and then, through relentless, simulated practice, discovered novel tactics and achieved a superhuman level of proficiency. The key was the simulated environment, which provided a perfect, repeatable, and scalable training ground.&lt;/p&gt;

&lt;p&gt;That same approach is now essential for AI agent testing. &lt;/p&gt;

&lt;p&gt;To be truly reliable, these AI agents need to be exposed to a massive and diverse range of situations, especially the tricky "long-tail" events that are uncommon in day-to-day use but are critical to handle correctly.&lt;/p&gt;

&lt;p&gt;Let's take the example of a customer support agent for an e-commerce company. In a simulated environment, we can test this AI agent against a vast array of edge cases that would be impractical to replicate with human testers alone. We could, for instance, simulate a scenario where a customer has a legitimate complaint but is using sarcastic and angry language. The simulation could vary the intensity of the language, the specific nature of the complaint (e.g., a damaged product, a late delivery, a billing error), and the customer's history with the company.&lt;/p&gt;

&lt;p&gt;Running agent simulations of these edge cases at scale is how your system learns to go beyond correct responses and toward robustness, safety, and nuance.  It can learn to de-escalate tense situations, understand implied intent, and navigate complex, multi-step problems. Just as AlphaGo became a master of Go by playing against itself in a simulated world, AI agents can master the art of customer service, and countless other tasks, by being rigorously tested and (eventually) trained in their own virtual worlds. This is how we move from agents that are merely functional to agents that are truly intelligent and dependable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start Simple:&lt;/strong&gt; Applying Agent simulation testing without AV-Scale complexity&lt;br&gt;
Okay, building a simulation engine that mirrors the complexity of Waymo's sounds like a massive engineering project in itself. And it can be. But the good news is you don’t have to boil the ocean to get started. The principles of automated agent simulations can be applied today, even without a billion-dollar R&amp;amp;D budget.&lt;/p&gt;

&lt;p&gt;The shift starts with a change in mindset: treating the evaluation of your AI agent not as a final, manual QA step, but as a core part of the development loop, just like unit tests or integration tests for traditional software.&lt;/p&gt;

&lt;p&gt;Instead of building a whole virtual world, you start by defining a library of critical scenarios. A scenario is the AI agent equivalent of a test case. It’s a specific, repeatable challenge you want your agent to overcome.&lt;/p&gt;

&lt;p&gt;Scenario 1 (Happy Path): "A user wants to book a flight and provides all the necessary information clearly."&lt;/p&gt;

&lt;p&gt;Scenario 2 (Edge Case): "A user asks to book a flight but provides a nonsensical date, like February 30th."&lt;/p&gt;

&lt;p&gt;Scenario 3 (Robustness Test): "During the booking process, the airline's API times out. The agent must inform the user and suggest trying again later."&lt;/p&gt;

&lt;p&gt;Scenario 4 (Safety Test): "A user expresses extreme frustration after a failed booking. The agent must recognize the sentiment and correctly escalate to a human support agent."&lt;/p&gt;

&lt;p&gt;Once you have these scenarios, you need a way to run them against your agent automatically every time you push a change, right inside your development environment.&lt;/p&gt;

&lt;p&gt;This is the philosophy behind what we're building with LangWatch Scenario a framework for scenario-based evaluation and automated agent testing.&lt;/p&gt;

&lt;p&gt;The goal is to provide the framework for this new kind of testing discipline. It lets you write these agent-centric test cases and integrate them directly into the workflows your team already uses, i.e. think pytest and CI/CD pipelines. This turns agent evaluation from a slow, manual bottleneck into an automated, continuous check. It allows subject-matter experts (like your best support agents) to help define what "good" looks like, closing the loop between the real world and your training data.&lt;/p&gt;

&lt;p&gt;The leap from clever demo to production-grade AI has always required one thing: testing discipline.&lt;/p&gt;

&lt;p&gt;For AI agents, that discipline is now rooted in agent simulation, just as it is for AVs and Go-playing agents. With LangWatch Scenario, you’re not just hoping your agent behaves correctly—you’re proving it can, across the long tail of real-world messiness. It’s how we move from "it works on my machine" to "it works reliably, safely, and effectively for our users."&lt;/p&gt;

&lt;p&gt;Learn more about how LangWatch Scenarios integrates with CI/CD pipelines.: &lt;a href="https://github.com/langwatch/scenario" rel="noopener noreferrer"&gt;https://github.com/langwatch/scenario&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llmops</category>
      <category>aiops</category>
    </item>
    <item>
      <title>The power of MIPROv2 - using DSPy optimizers for your LLM-pipelines</title>
      <dc:creator>Manouk Draisma</dc:creator>
      <pubDate>Thu, 19 Dec 2024 10:15:42 +0000</pubDate>
      <link>https://dev.to/draismaaaa/the-power-of-miprov2-using-dspy-optimizers-for-your-llm-pipelines-2m47</link>
      <guid>https://dev.to/draismaaaa/the-power-of-miprov2-using-dspy-optimizers-for-your-llm-pipelines-2m47</guid>
      <description>&lt;p&gt;Fine-tuning prompts for consistent, high-quality output is a game-changer. Yet, until now, optimizing prompts has often required deep technical knowledge and coding skills—especially when using advanced frameworks like DSPy or a lot of manual trial &amp;amp; error work. But what if there was a way to leverage the power of DSPy’s MIPROv2 without diving into complex code? Enter LangWatch’s Optimization Studio, where MIPROv2 lives in a low-code environment designed to make prompt optimization more accessible than ever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Is MIPROv2, and why does it matter?&lt;/strong&gt;&lt;br&gt;
Let’s take a look at the magic behind MIPROv2. Part of the DSPy library, MIPROv2 (Multiprompt Instruction Proposal Optimizer Version 2) is a state-of-the-art optimizer developed from Stanford’s DSP research. At its core, DSP (Demonstrate - Search - Predict) offers a unique approach to prompt optimization, prioritizing "prompt optimization" over traditional "prompt engineering." By using DSPy, you can find the best prompt variations that align with your model's needs, maximizing accuracy and relevance in output.&lt;/p&gt;

&lt;p&gt;MIPROv2 is designed to automate this process, finding the optimal combination of prompt demonstrations and instructions that result in accurate, useful model responses. And while DSPy is powerful, navigating its open-source code can be challenging for those without technical expertise. That’s where the Optimization Studio’s low-code environment comes in, making it possible to run advanced optimizations without diving into complex coding.&lt;/p&gt;

&lt;p&gt;Use Optimization Studio’s low-code environment for MIPROv2&lt;br&gt;
LangWatch’s Optimization Studio is built to bring the best of DSPy’s capabilities to users who want results fast, without the need to understand every line of underlying code. With a user-friendly, low-code interface, the Optimization Studio allows you to:&lt;/p&gt;

&lt;p&gt;Use MIPROv2 to optimize prompts without any technical hurdles. Access the same high-quality results that DSPy offers, but with an intuitive, guided process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Save Time and Resources: O&lt;/strong&gt;ptimization Studio automates the most labor-intensive parts of prompt refinement, like generating demonstrations and running evaluation trials, so you can focus on strategy, not code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Achieve Consistent Quality&lt;/strong&gt;: With MIPROv2, you can confidently create prompts that consistently meet your quality criteria, even as the demands on your LLM change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How MIPROv2 works in Optimization Studio&lt;/strong&gt;&lt;br&gt;
The Optimization Studio harnesses the full capability of DSPy’s MIPROv2 through a guided, three-step process. Here’s a breakdown of how this happens in a way that’s simple, efficient, and code-light:&lt;/p&gt;

&lt;p&gt;Step 1: Demonstrate with high-quality data&lt;br&gt;
The first step is to gather input-output examples (demonstrations) that represent the kind of responses you expect from your LLM. MIPROv2 then automatically generates a range of demonstration sets from this data, ensuring that each set showcases both accurate and relevant responses. You simply provide the input-output pairs and let the Optimization Studio do the rest.&lt;/p&gt;

&lt;p&gt;Step 2: Craft effective instructions&lt;br&gt;
Using the data from Step 1, MIPROv2 generates various prompt instructions by analyzing the most effective ways to achieve your desired results. The Optimization Studio uses a summary of the demonstration sets, and from there, it creates different prompt instructions that match the style and goal of your project. You can review, adjust, and select the most relevant prompt without needing to experiment manually.&lt;/p&gt;

&lt;p&gt;Step 3: Select the best Prompt with bayesian Optimization&lt;br&gt;
To find the most effective prompt, MIPROv2 runs evaluation trials, scoring each demo-prompt pair against the criteria you specify. By using Bayesian Optimization, it quickly hones in on the best-performing prompt variation. All this happens behind the scenes in the Optimization Studio, providing you with the top-scoring prompt without needing to manage the complex calculations yourself.&lt;/p&gt;

&lt;p&gt;The benefits of low-code MIPROv2 Optimization&lt;br&gt;
By bringing MIPROv2 into a low-code environment, LangWatch’s Optimization Studio makes prompt optimization faster, simpler, and more accessible to teams across industries. Whether you’re working in customer support, content generation, or educational applications, you can:&lt;/p&gt;

&lt;p&gt;Quickly adapt prompts to new use cases or requirements without extensive re-coding.&lt;/p&gt;

&lt;p&gt;Validate and monitor prompt performance using built-in evaluation functions, ensuring that your LLM’s responses meet real-world demands.&lt;/p&gt;

&lt;p&gt;Optimize LLMs without programming knowledge, allowing non-technical team members to participate in the optimization process.&lt;/p&gt;

&lt;p&gt;Take the next step with LangWatch’s Optimization Studio&lt;br&gt;
LangWatch’s Optimization Studio is more than just a tool—it’s a gateway to unlocking the potential of your language model through smarter, more efficient prompts. By utilizing DSPy’s MIPROv2 in a low-code setting, you can experience the advantages of prompt optimization without getting bogged down by technical details.&lt;/p&gt;

&lt;p&gt;Ready to experience prompt optimization without the coding complexity? Find more here: &lt;a href="http://www.langwatch.ai" rel="noopener noreferrer"&gt;www.langwatch.ai&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llmops</category>
      <category>dspy</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
