<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Scarlett Attensil</title>
    <description>The latest articles on DEV Community by Scarlett Attensil (@sattensil888).</description>
    <link>https://dev.to/sattensil888</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3443244%2Fcaef40f2-e953-4f43-954d-018fdc1832e7.png</url>
      <title>DEV Community: Scarlett Attensil</title>
      <link>https://dev.to/sattensil888</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sattensil888"/>
    <language>en</language>
    <item>
      <title>Benchmark LangGraph, Strands, OpenAI Agents, and Google ADK on the same agent graph</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Tue, 23 Jun 2026 19:03:00 +0000</pubDate>
      <link>https://dev.to/launchdarkly/benchmark-langgraph-strands-openai-agents-and-google-adk-on-the-same-agent-graph-11me</link>
      <guid>https://dev.to/launchdarkly/benchmark-langgraph-strands-openai-agents-and-google-adk-on-the-same-agent-graph-11me</guid>
      <description>&lt;p&gt;Agent framework debates are mostly vibes. One engineer swears LangGraph is faster, another prefers the OpenAI Agents SDK, someone wants Google ADK because it feels future-proof. The team picks one, wires the workflow into its SDK, and the choice is welded in. Changing frameworks later means tearing out the wiring for one SDK and rebuilding the workflow on another, an expensive rewrite few teams take on.&lt;/p&gt;

&lt;p&gt;This tutorial makes that decision reversible and then settles it with data. You put the &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/agent-graphs" rel="noopener noreferrer"&gt;agent graph&lt;/a&gt; in LaunchDarkly and run four frameworks (&lt;a href="https://langchain-ai.github.io/langgraph/" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt;, &lt;a href="https://strandsagents.com/" rel="noopener noreferrer"&gt;Strands&lt;/a&gt;, &lt;a href="https://openai.github.io/openai-agents-python/" rel="noopener noreferrer"&gt;OpenAI Agents SDK&lt;/a&gt;, and &lt;a href="https://google.github.io/adk-docs/" rel="noopener noreferrer"&gt;Google ADK&lt;/a&gt;) over the same topology, with the model pinned so the framework is the only variable. A LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;experiment&lt;/a&gt; ranks them on graph latency and token use, with an LLM judge guarding quality. The results table tells you which framework runs your graph fastest without degrading it.&lt;/p&gt;

&lt;p&gt;This tutorial is the sequel to &lt;a href="https://launchdarkly.com/docs/tutorials/ai-orchestrators" rel="noopener noreferrer"&gt;Compare AI orchestrators&lt;/a&gt;, which ran the same workflow across frameworks but kept the topology in each framework's code. Here, the topology, routing, models, prompts, tools, and judge all live in LaunchDarkly, and each framework supplies only two functions.&lt;/p&gt;

&lt;p&gt;The experiment results do more than set a benchmark. The flag that splits experiment traffic also routes production. When one framework wins, you don't rewrite the app; you change the flag to serve the winner. In a single loop, LaunchDarkly does three jobs: the graph definition, the experiment split, and the runtime control that ships the winner.&lt;/p&gt;

&lt;p&gt;The workload is a research-gap analysis over a set of &lt;a href="https://arxiv.org/" rel="noopener noreferrer"&gt;arXiv&lt;/a&gt; papers. Two readers, &lt;code&gt;approach-analyzer&lt;/code&gt; and &lt;code&gt;contradiction-detector&lt;/code&gt;, read the same papers in parallel and fan in to &lt;code&gt;gap-synthesizer&lt;/code&gt;, which writes the report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A LaunchDarkly account with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol" rel="noopener noreferrer"&gt;AgentControl&lt;/a&gt; access, and your environment's &lt;a href="https://launchdarkly.com/docs/home/account/environment/keys#view-or-copy-sdk-credentials" rel="noopener noreferrer"&gt;SDK key&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Python 3.11+ and &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;uv&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; for the pinned model. &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;GOOGLE_API_KEY&lt;/code&gt; are only needed if you run the optional native-model bake-off in Step 9&lt;/li&gt;
&lt;li&gt;The companion repo: &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators" rel="noopener noreferrer"&gt;&lt;code&gt;ai-orchestrators&lt;/code&gt;&lt;/a&gt; on branch &lt;code&gt;tutorial/graph-experiments&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The experiment design
&lt;/h2&gt;

&lt;p&gt;The comparison is controlled: same graph, same model, same papers, same judge, with the framework as the only variable. Mechanically it runs in four stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Bootstrap.&lt;/strong&gt; &lt;code&gt;manifest.yaml&lt;/code&gt; creates the node configs, graph, &lt;code&gt;orchestrator&lt;/code&gt; flag, and judge in LaunchDarkly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Route.&lt;/strong&gt; On each request, the app evaluates the &lt;code&gt;orchestrator&lt;/code&gt; flag to pick a framework: &lt;code&gt;langgraph&lt;/code&gt;, &lt;code&gt;strands&lt;/code&gt;, &lt;code&gt;openai-agents&lt;/code&gt;, or &lt;code&gt;google-adk&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run.&lt;/strong&gt; The dispatcher runs the shared graph as a directed acyclic graph (DAG). The two readers run concurrently and fan in to the synthesizer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure.&lt;/strong&gt; Each run records how long the graph took, how many tokens it used, and whether the report passed the quality judge.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The shape looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        ┌──▶  approach-analyzer  ───────┐
   intake (papers) ─────┤                               ├──▶  gap-synthesizer  ──▶  report
                        └──▶  contradiction-detector  ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1: Create the graph, flag, and judge
&lt;/h2&gt;

&lt;p&gt;Everything starts from one file, &lt;code&gt;config/graph_experiment_manifest.yaml&lt;/code&gt;. It declares the &lt;code&gt;fetch_paper&lt;/code&gt; tool, four node configs (&lt;code&gt;intake&lt;/code&gt; plus the three agents, pinned to &lt;code&gt;claude-sonnet-4-5&lt;/code&gt;), the graph, the &lt;code&gt;orchestrator&lt;/code&gt; flag, and the judge.&lt;/p&gt;

&lt;p&gt;First, clone the companion repo and install its dependencies with uv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/launchdarkly-labs/ai-orchestrators
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-orchestrators
git checkout tutorial/graph-experiments
uv &lt;span class="nb"&gt;sync&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, set up a LaunchDarkly project. The bootstrap doesn't create one, so create it with the &lt;a href="https://launchdarkly.com/docs/home/getting-started/mcp" rel="noopener noreferrer"&gt;LaunchDarkly MCP server&lt;/a&gt;, the &lt;a href="https://github.com/launchdarkly/ai-tooling/tree/main/skills/agentcontrol/projects" rel="noopener noreferrer"&gt;&lt;code&gt;projects&lt;/code&gt; agent skill&lt;/a&gt;, or &lt;a href="https://launchdarkly.com/docs/home/account/project#create-projects" rel="noopener noreferrer"&gt;the UI&lt;/a&gt;. Name it &lt;code&gt;graph-experiments&lt;/code&gt; to match the value in &lt;code&gt;.env.example&lt;/code&gt;, so the defaults work without edits. When it exists, copy its key into &lt;code&gt;LD_PROJECT_KEY&lt;/code&gt; and its production environment SDK key into &lt;code&gt;LD_SDK_KEY&lt;/code&gt; in &lt;code&gt;.env&lt;/code&gt;. The runners and experiment harness use that SDK key to evaluate the flag and graph. The bootstrap also reads &lt;code&gt;LD_API_KEY&lt;/code&gt; from &lt;code&gt;.env&lt;/code&gt; to create the resources.&lt;/p&gt;

&lt;p&gt;Copy the example file to create your &lt;code&gt;.env&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# then set LD_PROJECT_KEY, LD_SDK_KEY, and LD_API_KEY in .env&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the keys in place, run the bootstrap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python scripts/launchdarkly/bootstrap.py config/graph_experiment_manifest.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates all four node configs, the &lt;code&gt;research-gap-graph&lt;/code&gt;, the &lt;code&gt;orchestrator&lt;/code&gt; flag (created &lt;strong&gt;off&lt;/strong&gt;), and the &lt;code&gt;gap-quality-judge&lt;/code&gt; attached to the &lt;code&gt;gap-synthesizer&lt;/code&gt; node (its &lt;code&gt;synthesizer-claude&lt;/code&gt; variation, set to 100% sampling). The judge scores the final report against the source papers, so it can verify grounding and citations. A judge can only check based on the information it has, so we give it the papers, not only an upstream agent's analysis.&lt;/p&gt;

&lt;p&gt;When the graph ships, it is incomplete by design. The bootstrap creates the &lt;code&gt;contradiction-detector&lt;/code&gt; config but wires only &lt;code&gt;intake&lt;/code&gt; to &lt;code&gt;approach-analyzer&lt;/code&gt; to &lt;code&gt;gap-synthesizer&lt;/code&gt;, leaving the detector out. You'll add it in Step 5 to complete the parallel fan-in.&lt;/p&gt;

&lt;p&gt;When it finishes, the bootstrap prints a link to your new agent graph. Open it and review the topology before moving on. The graph shows a straight line from &lt;code&gt;intake&lt;/code&gt; to &lt;code&gt;approach-analyzer&lt;/code&gt; to &lt;code&gt;gap-synthesizer&lt;/code&gt;, with &lt;code&gt;contradiction-detector&lt;/code&gt; created but not yet wired in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhkgx3zu1i4l7za5onm6t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhkgx3zu1i4l7za5onm6t.png" alt="The bootstrapped graph in the agent graph builder, showing the incomplete graph with contradiction-detector created but not yet wired in." width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The flag is intentionally set to off&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Until the experiment is live, &lt;code&gt;ld.variation("orchestrator", …)&lt;/code&gt; falls back to the code default &lt;code&gt;"langgraph"&lt;/code&gt;, so every request routes to LangGraph. That behavior is correct for this stage. You'll force specific frameworks in Step 4, and the experiment takes over in Step 7.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 2: The dispatcher runs the graph
&lt;/h2&gt;

&lt;p&gt;The dispatcher is the heart of the project, and it's the same code for every framework. It reads the graph as a DAG, runs the entry nodes concurrently, hands every node the papers as ground truth, and connects the readers at the fan-in node. The only framework-specific pieces are &lt;code&gt;build_agent&lt;/code&gt; and &lt;code&gt;invoke&lt;/code&gt;, which are passed in as arguments.&lt;/p&gt;

&lt;p&gt;The whole process is about 100 lines, built on the agent graph traversal methods in the SDK. The &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/tutorial/graph-experiments/orchestrators/dispatcher.py" rel="noopener noreferrer"&gt;complete &lt;code&gt;dispatcher.py&lt;/code&gt;&lt;/a&gt; is in the companion repo.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Traversal methods or the managed run&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The SDK gives you two ways to run an agent graph. This dispatcher uses the &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/agent-graphs#agent-graphs-in-the-sdk" rel="noopener noreferrer"&gt;traversal methods&lt;/a&gt;, the lower-level API you walk yourself: &lt;code&gt;agent_graph()&lt;/code&gt;, &lt;code&gt;reverse_traverse()&lt;/code&gt;, the node and edge accessors, and the graph tracker. The SDK also offers a fully managed &lt;code&gt;create_agent_graph(...).run()&lt;/code&gt; that handles orchestration and collects metrics for you with no traversal code, which we recommend when you're on a supported framework, such as LangGraph or the OpenAI Agents SDK, and don't need to inspect the run.&lt;/p&gt;

&lt;p&gt;We use the traversal methods here because in a bake-off the walk itself is a controlled variable: one traversal with identical semantics for every framework (the managed runner covers two of the four today), concurrent execution of the independent readers under our own scheduling, and control over the exact input each agent and the judge receive. Same SDK, lower-level surface.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The dispatcher carries the design in four parts: it builds the execution plan from the graph's edges, composes each node's input, runs every ready node concurrently each round, and records the graph's metrics once per run.&lt;/p&gt;

&lt;p&gt;First, the dispatcher builds the execution plan from the graph's edges, so the topology you draw in LaunchDarkly runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;edge&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_edges&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;edge&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target_config&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;succ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, every node receives the source papers and any upstream analyses, so each agent and the judge work directly from the source material rather than a summary handed down a chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compose_input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;predecessor_outputs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== SOURCE PAPERS ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;predecessor_outputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ===&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then each round runs every node whose predecessors have finished, concurrently, so the two readers fan out and fan in with no special casing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ready&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pending&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;done&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;preds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ready&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, the dispatcher records the graph's metrics on each run, including the end-to-end latency the experiment ranks on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;graph_tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;graph_tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_total_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;TokenUsage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;totals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;graph_tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph_tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_invocation_success&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dispatcher reads the topology at runtime, so reshaping the workflow in the UI, adding a node, or redrawing an edge takes effect on the next request with no code change. You'll do exactly that in Step 5.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Each framework is a thin adapter
&lt;/h2&gt;

&lt;p&gt;Each framework implements &lt;code&gt;build_agent(node_key, config, instructions)&lt;/code&gt; and &lt;code&gt;async invoke(agent, input_text, tracker)&lt;/code&gt;. Everything dynamic still comes from the LaunchDarkly node config: the model, the attached tools, and the instructions.&lt;/p&gt;

&lt;p&gt;LangGraph has a LaunchDarkly companion package, so its runner is only a few lines. The companion handles model creation, tool binding, and token tracking, so the adapter holds no framework plumbing of its own:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_langchain_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TOOL_REGISTRY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# binds only this node's attached tools
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_metrics_of_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;LDAIMetrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;sum_token_usage_from_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))),&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ainvoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input_text&lt;/span&gt;&lt;span class="p"&gt;}]}),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;get_tool_calls_from_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_tool_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_content_to_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;sum_token_usage_from_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strands has no companion package, so its runner builds the model with a small provider-aware factory and binds tools with Strands' native &lt;code&gt;@tool&lt;/code&gt;. The contract is identical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;node_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_create_strands_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Process the input and respond.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;callback_handler&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenAI Agents and Google ADK round out the four. For the comparison to stay fair, all four have to run the same model, but these two SDKs default to their own vendors' models. LiteLLM, a thin adapter, lets them call any provider, so we point both at the pinned &lt;code&gt;claude-sonnet-4-5&lt;/code&gt; and keep the model identical across all four orchestrators. No OpenAI or Google servers are involved. Instead, LiteLLM translates the request format in process and the call goes straight to Anthropic with your key.&lt;/p&gt;

&lt;p&gt;Google ADK is fully companion-free, and OpenAI Agents uses the &lt;code&gt;ldai_openai&lt;/code&gt; companion for token and tool-call telemetry even though it builds the model through LiteLLM. This experiment pins one model across all four frameworks, so every framework here runs Claude. Pointing each framework at its own vendor's default model instead is a separate, optional exercise, the native-model bake-off in Step 9. The tool callables live in &lt;code&gt;TOOL_REGISTRY&lt;/code&gt;, a plain &lt;code&gt;{name: callable}&lt;/code&gt; map that each framework binds its own way.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;One tracking API, any framework&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LaunchDarkly records tokens and latency through one framework-agnostic tracker. You provide a &lt;code&gt;TokenUsage&lt;/code&gt; and call &lt;code&gt;track_*&lt;/code&gt;, and the metrics flow the same way regardless of orchestrator. For LangGraph and OpenAI Agents, the companion helpers (&lt;code&gt;ldai_langchain&lt;/code&gt;, &lt;code&gt;ldai_openai&lt;/code&gt;) populate it automatically. For anything else, you read the framework's own usage and pass it along. Every orchestrator emits identical metrics, so you can compare them directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can also use framework-specific tutorials&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want a framework-specific starting point, &lt;a href="https://launchdarkly.com/docs/tutorials/agents-langgraph" rel="noopener noreferrer"&gt;Build a LangGraph multi-agent system&lt;/a&gt; walks the LangGraph path from scratch, and &lt;a href="https://launchdarkly.com/docs/tutorials/ld4a-langgraph-migration" rel="noopener noreferrer"&gt;Migrate a hardcoded LangGraph agent to AgentControl&lt;/a&gt; shows how to externalize an existing agent's config and prompts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 4: Smoke test the graph
&lt;/h2&gt;

&lt;p&gt;Before you run any experiment, confirm the bootstrapped graph runs end to end. First, run one framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python orchestrators/verify_run.py langgraph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It prints the path it took and the first part of the report. On the graph as it shipped, the path is &lt;code&gt;intake -&amp;gt; approach-analyzer -&amp;gt; gap-synthesizer&lt;/code&gt;: &lt;code&gt;intake&lt;/code&gt; runs its short pass, &lt;code&gt;approach-analyzer&lt;/code&gt; reads the papers, and &lt;code&gt;gap-synthesizer&lt;/code&gt; writes the report. There's no &lt;code&gt;contradiction-detector&lt;/code&gt; yet, and no error. The metrics land in the AgentControl UI under the graph you created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Add the parallel fan-in in the UI
&lt;/h2&gt;

&lt;p&gt;Here's the payoff of keeping the topology in LaunchDarkly: you finish building the workflow in the UI, with no redeploy, and the running app picks up the new shape on its next request. The &lt;code&gt;contradiction-detector&lt;/code&gt; config already exists, with its &lt;code&gt;fetch_paper&lt;/code&gt; tool attached. You wire it into the graph to add the second reader and form the parallel fan-in.&lt;/p&gt;

&lt;p&gt;To complete the graph:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Agents&lt;/strong&gt; in the LaunchDarkly sidebar.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Agent graphs&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select &lt;code&gt;research-gap-graph&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Add the &lt;code&gt;contradiction-detector&lt;/code&gt; node.&lt;/li&gt;
&lt;li&gt;Draw an edge from &lt;code&gt;intake&lt;/code&gt; to &lt;code&gt;contradiction-detector&lt;/code&gt;, then another from &lt;code&gt;contradiction-detector&lt;/code&gt; to &lt;code&gt;gap-synthesizer&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Save&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You add no routing logic: the edge itself is the route, because routing is structural.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9ny7h1sz9sjcp9eqodhc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9ny7h1sz9sjcp9eqodhc.png" alt="The completed graph after adding the contradiction-detector node and its two edges in the UI." width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Re-run the smoke test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python orchestrators/verify_run.py langgraph
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The path now includes &lt;code&gt;contradiction-detector&lt;/code&gt;, and because &lt;code&gt;approach-analyzer&lt;/code&gt; and &lt;code&gt;contradiction-detector&lt;/code&gt; run concurrently, their order can vary. You completed a multi-agent workflow from the UI, and the config you wired in already had its tool attached.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fngfo016yshvab3gc02vc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fngfo016yshvab3gc02vc.png" alt="The smoke test after completing the graph, with the two readers running concurrently and the path routing through contradiction-detector." width="797" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You finished a multi-agent workflow from the UI, mid-development, and the dispatcher ran the new shape on the next request. No redeploy, no code change: the graph you draw is the graph that runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Smoke test all four frameworks
&lt;/h2&gt;

&lt;p&gt;Before you collect experiment data, make sure all four frameworks can run the completed graph. One command runs all four in sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python orchestrators/verify_run.py all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It runs each framework against the completed graph and ends with a pass/fail summary, one line per framework, exiting non-zero if any framework failed, so it works as a gate. Each framework prints the path it took and a preview of its report, then a final summary collects the results. A successful run looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;▶ Running 'langgraph' over 2 papers on graph 'research-gap-graph'...
  ✓ PATH : intake -&amp;gt; contradiction-detector -&amp;gt; approach-analyzer -&amp;gt; gap-synthesizer
▶ Running 'strands' over 2 papers on graph 'research-gap-graph'...
  ✓ PATH : intake -&amp;gt; contradiction-detector -&amp;gt; approach-analyzer -&amp;gt; gap-synthesizer
▶ Running 'openai-agents' over 2 papers on graph 'research-gap-graph'...
  ✓ PATH : intake -&amp;gt; contradiction-detector -&amp;gt; approach-analyzer -&amp;gt; gap-synthesizer
▶ Running 'google-adk' over 2 papers on graph 'research-gap-graph'...
  ✓ PATH : intake -&amp;gt; contradiction-detector -&amp;gt; approach-analyzer -&amp;gt; gap-synthesizer
=== smoke summary ===
  ✓ langgraph
  ✓ strands
  ✓ openai-agents
  ✓ google-adk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a framework fails, its line shows an ✗ instead of a ✓ and the command exits non-zero. All four smoke-test against the pinned Claude model. &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; is the only model key you need, because OpenAI Agents and Google ADK reach Claude through LiteLLM. The OpenAI Agents SDK turns on tracing by default and looks for &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; to export traces, so the &lt;code&gt;openai-agents&lt;/code&gt; run may print a harmless tracing warning when that key is absent. It doesn't affect the run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Run it through the experiment
&lt;/h2&gt;

&lt;p&gt;Now you can use a LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;experiment&lt;/a&gt; to rank the four frameworks on real traffic, on the same graph, with the model held constant. Because the model is fixed, the comparison is operational: which orchestrator delivers the model's quality fastest, with the least token overhead. The bootstrap already created the flag, the judge, and the graph.&lt;/p&gt;

&lt;p&gt;These metrics are measured on each request, so do a one-time setup first:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make the &lt;code&gt;request&lt;/code&gt; context kind available for experiments.&lt;/li&gt;
&lt;li&gt;Set the analysis unit of graph latency, tokens, and the judge metric to &lt;code&gt;request&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Then create the experiment in the UI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an experiment with the &lt;code&gt;orchestrator&lt;/code&gt; flag as the treatment.&lt;/li&gt;
&lt;li&gt;Set the primary metric to &lt;strong&gt;Graph latency&lt;/strong&gt; (&lt;code&gt;$ld:ai:graph:duration:total&lt;/code&gt;, the time for a complete graph execution).&lt;/li&gt;
&lt;li&gt;Add tokens and &lt;code&gt;$ld:ai:judge:gap-quality&lt;/code&gt; as secondary metrics.&lt;/li&gt;
&lt;li&gt;Set the audience to 100% and the randomization unit to &lt;strong&gt;request&lt;/strong&gt;. Each run is a single request, there are no users in this workflow, and request is the unit LaunchDarkly measures AI and graph metrics by.&lt;/li&gt;
&lt;li&gt;Turn on the &lt;code&gt;orchestrator&lt;/code&gt; flag, which the bootstrap created set to off, so it serves the experiment's variations.&lt;/li&gt;
&lt;li&gt;Start an experiment iteration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkvdtsndy88syxed886ah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkvdtsndy88syxed886ah.png" alt="The experiment in the LaunchDarkly UI with the orchestrator flag as the treatment, the metrics you chose, and an even split across the four frameworks." width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We rank on latency and tokens because, with the model and the graph held constant, those are the things that genuinely differ: a framework can move quality only by degrading the plumbing, like a truncated report or a broken tool call. So &lt;code&gt;$ld:ai:judge:gap-quality&lt;/code&gt; stays a guardrail that catches a framework "winning" by cutting corners, not part of the ranking. Swap the model, prompt, or tools later instead of the framework, and that same judge becomes your primary metric.&lt;/p&gt;

&lt;p&gt;Then drive traffic. The flag assigns each run one framework at random:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python scripts/run_experiment.py &lt;span class="nt"&gt;--runs-per-category&lt;/span&gt; 6
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's six runs over each of the six shipped topics, 36 in total. Assignment is random, so it usually fills all four variations, though it isn't guaranteed. Each run analyzes the topic's entire paper set, because gap analysis needs every paper to find real gaps.&lt;/p&gt;

&lt;p&gt;Open the experiment in LaunchDarkly: latency per variation, with tokens and &lt;code&gt;$ld:ai:judge:gap-quality&lt;/code&gt; alongside. The winner is the framework with the best latency and lowest token use that doesn't let quality slip. Because the model is pinned, cost is a fixed multiple of tokens, so the token column is also the cost ranking; for actual dollar figures, read them from &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/insights" rel="noopener noreferrer"&gt;Insights&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Because the experiment holds everything but the framework constant, most of these bars land close, often within a few percent, which is by design.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F66lx4a2uj4hq6viq0x83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F66lx4a2uj4hq6viq0x83.png" alt="The experiment results in the LaunchDarkly UI: graph latency, completion time, and tokens for each framework variation, side by side." width="799" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our run, Strands won on speed: it ran the graph fastest, with quality holding at the guardrail. If you optimize for speed and quality holds, that makes Strands the orchestrator to ship for this workload. Six topics and one randomized split isn't a large sample, so confirm the lead with more topics before you standardize on it. You can do that in Step 9.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 8: Ship the winner with runtime control
&lt;/h2&gt;

&lt;p&gt;The experiment gave you data. The reason to run it in LaunchDarkly, rather than a one-off script, is that acting on that data takes no deploy: the &lt;code&gt;orchestrator&lt;/code&gt; flag that was the experiment treatment is also your production router.&lt;/p&gt;

&lt;p&gt;When a variation wins, stop the iteration and set the flag's default to that framework. Every request routes to it on the next evaluation, with no redeploy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flfozhebv041gmh6tgd5t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flfozhebv041gmh6tgd5t.png" alt="The orchestrator flag with a variation per framework, serving the default to production as the runtime router." width="800" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then automate what you don't want to babysit. An &lt;a href="https://launchdarkly.com/blog/agentcontrol-adaptive-triggers/" rel="noopener noreferrer"&gt;adaptive trigger&lt;/a&gt; watches a guardrail and changes a flag on its own when production drifts past it. The orchestrator you shipped is operational and won't degrade by itself, so point the trigger at the model flag from Step 9: it fails over to a backup model when your primary provider has a bad day, the same guardrail driving a different flag. That closes the loop: experiment to find the winner, runtime control to ship it, and automation to keep it healthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Extend the experiment
&lt;/h2&gt;

&lt;p&gt;Tighten the bands by adding more topics. Confidence comes from more distinct topics, not more runs over the same few. Download one with a title-phrase (&lt;code&gt;ti:&lt;/code&gt;) query, and the harness picks it up automatically on the next run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python scripts/download_papers.py &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'ti:"LLM-as-a-judge"'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make quality the headline by flipping a config, not a flag. The framework lives in the &lt;code&gt;orchestrator&lt;/code&gt; flag because it is app-level routing, not a property of any agent. The model, the prompt, and the tool set are different: they live in the node configs, so you experiment on the config itself. Add a second variation to a node, such as &lt;code&gt;gap-synthesizer&lt;/code&gt; with a stronger model or a tightened prompt, and run an &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;experiment&lt;/a&gt; with that config as the treatment and its variations as the arms. Pin the framework by setting the &lt;code&gt;orchestrator&lt;/code&gt; flag to one value and leave the graph alone, so the config is the only thing moving. The judge attached to the synthesizer already emits &lt;code&gt;$ld:ai:judge:gap-quality&lt;/code&gt;, so quality is the primary metric with no new instrumentation. Now it genuinely moves, because a different model or prompt reasons differently about the same papers.&lt;/p&gt;

&lt;p&gt;Experiment on the graph shape with a graph-key flag. The dispatcher takes the graph key as an argument, so the shape is another value you can put behind a flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;graph_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;graph_shape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research-gap-graph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;execute_graph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;graph_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;build_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build two graphs with different keys: for example, a linear &lt;code&gt;research-gap-graph-linear&lt;/code&gt; (&lt;code&gt;intake&lt;/code&gt; to &lt;code&gt;approach-analyzer&lt;/code&gt; to &lt;code&gt;gap-synthesizer&lt;/code&gt;) against the parallel &lt;code&gt;research-gap-graph&lt;/code&gt;, or one with an added critic node against one without. Make a multivariate &lt;code&gt;graph_shape&lt;/code&gt; flag whose variations are those graph keys, evaluate it exactly as the app evaluates &lt;code&gt;orchestrator&lt;/code&gt;, and set it as the experiment treatment with the framework and model held constant. You are measuring whether the extra structure earns its latency and quality, and because the dispatcher runs whatever shape the key resolves to, no runner or dispatcher code changes. You build the judge once, and it is the guardrail for the framework bake-off and the headline metric for every model, prompt, tool, and shape you test next.&lt;/p&gt;

&lt;p&gt;Run a native-model bake-off. This experiment holds the model constant so the framework is the only variable. To compare each framework on its own default model instead, build separate node configs per framework. This is the optional bake-off the prerequisites mention. It's a follow-up beyond this walkthrough, and the only part that needs &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; and &lt;code&gt;GOOGLE_API_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Whatever you flip, follow three rules:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Change one variable at a time (the framework, the model, or the shape), never two. If you change more than one, you can't attribute the win.&lt;/li&gt;
&lt;li&gt;Keep the quality guardrail on every run, because the fastest variant is often the one that quietly truncated its report or dropped a tool call.&lt;/li&gt;
&lt;li&gt;Earn confidence with distinct inputs, not repeats: a tight band around three repeated topics is still a tight band around the wrong number.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To learn more about judge design, read &lt;a href="https://launchdarkly.com/docs/tutorials/when-to-add-online-evals" rel="noopener noreferrer"&gt;When to add online evals&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/tutorials/custom-evals-claude-code" rel="noopener noreferrer"&gt;Evaluating with LLM-as-judge evaluators&lt;/a&gt;. To add a pre-production regression layer, read &lt;a href="https://launchdarkly.com/docs/tutorials/offline-evals" rel="noopener noreferrer"&gt;Offline evaluation of RAG-grounded answers&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recap and next steps
&lt;/h2&gt;

&lt;p&gt;Framework choice doesn't have to be a one-way door. Put the topology in a &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/agent-graphs" rel="noopener noreferrer"&gt;LaunchDarkly agent graph&lt;/a&gt;, have each framework supply only &lt;code&gt;build_agent&lt;/code&gt; and &lt;code&gt;invoke&lt;/code&gt;, and let one &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;experiment&lt;/a&gt; settle a question that usually gets answered by whoever argues hardest: pin the model, let the judge guard quality, and pick the orchestrator that delivers it fastest, with evidence in hand.&lt;/p&gt;

&lt;p&gt;Then keep going, because the framework is only the first swappable component. The same flag, experiment, and judge machinery compares models, prompts, tools, and whole graph shapes the same way, so "which is better" stops being a debate and becomes a measurement. And because the experiment and the runtime control are one flag, you never stop at a finding: you ship it, ramp it with a progressive rollout, and let an &lt;a href="https://launchdarkly.com/blog/agentcontrol-adaptive-triggers/" rel="noopener noreferrer"&gt;adaptive trigger&lt;/a&gt; hold the line in production while &lt;a href="https://launchdarkly.com/docs/tutorials/ai-iteration-loop-for-reliable-agents" rel="noopener noreferrer"&gt;the AI iteration loop for reliable agents&lt;/a&gt; keeps the next change shipping behind eval gates.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/tree/tutorial/graph-experiments" rel="noopener noreferrer"&gt;complete code&lt;/a&gt; is in the sample repo. &lt;a href="https://app.launchdarkly.com/signup" rel="noopener noreferrer"&gt;Get started with AgentControl&lt;/a&gt;, point the four frameworks at a graph your team actually runs, and settle the next framework argument with a number instead of a hunch.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>orchestrators</category>
      <category>experimentation</category>
      <category>evals</category>
    </item>
    <item>
      <title>AI Experimentation Best Practices: From Evaluation to Safe Production Rollouts</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:09:35 +0000</pubDate>
      <link>https://dev.to/launchdarkly/ai-experimentation-best-practices-from-evaluation-to-safe-production-rollouts-4536</link>
      <guid>https://dev.to/launchdarkly/ai-experimentation-best-practices-from-evaluation-to-safe-production-rollouts-4536</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Artificial intelligence tools, particularly large language models (LLMs), are not like traditional software. AI is probabilistic, so the same instructions and inputs can produce different results, especially when using non-zero temperature or other sampling methods, and those results can shift as your context changes. That unpredictability brings real risks because models can miss the mark, invent facts, or generate unfair or unsafe outputs. They can also incur unexpected costs and slow down under heavy loads, and they must constantly adapt to evolving policies and ethical guidelines.&lt;/p&gt;

&lt;p&gt;AI experimentation means iteratively testing data, algorithms, prompts, models, and parameters to optimize model performance and validate hypotheses. You need a clear, repeatable way to try ideas, compare prompts and models, validate how your system finds and uses information, and do safety checks before changes reach real users. Experimentation is not just a nice-to-have; it is essential for shipping AI responsibly, optimizing resource efficiency, reducing costs, and accelerating innovation through rapid, evidence-based iteration cycles.&lt;/p&gt;

&lt;p&gt;Throughout this guide, we distinguish &lt;strong&gt;evaluation&lt;/strong&gt; from &lt;strong&gt;experimentation&lt;/strong&gt;. Evaluation means offline benchmarking and scoring, including test sets, human or AI judges, and quality metrics. Experimentation means controlled production changes that affect real users through A/B tests, staged rollouts, or other release strategies. Evaluation tells you whether a variant clears a quality bar; experimentation tells you whether it beats the baseline in production, with statistical confidence and guardrails.&lt;/p&gt;

&lt;p&gt;In this article, we cover the core ideas and practical steps for AI experimentation: how to plan a test, evaluate changes, run controlled trials with real users, choose metrics that actually matter to your product, and roll out changes safely. By the end, you will have a process that moves from initial concept to monitored, controlled production release.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Experimentation Best Practices
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Best Practice&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Use experimentation to manage uncertainty&lt;/td&gt;
&lt;td&gt;AI outputs can shift over time. Structured experimentation helps teams measure, compare, and validate changes before they reach users. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;AgentControl experiments&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/releases/releasing" rel="noopener noreferrer"&gt;release options&lt;/a&gt; help turn unpredictability into a controlled process for improvement.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build trust through evidence, not intuition&lt;/td&gt;
&lt;td&gt;Without experimentation, teams rely on gut feeling. Controlled tests provide measurable evidence of what works. Use &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;LaunchDarkly Experimentation&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt; to make confident, data-driven decisions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detect and reduce hidden risks early&lt;/td&gt;
&lt;td&gt;Experimentation surfaces hallucinations, bias, latency regressions, and safety failures before they affect broad audiences. &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;Online evaluations&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; help teams detect regressions and pause or roll back unsafe changes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enable continuous improvement&lt;/td&gt;
&lt;td&gt;AI systems evolve as data, models, and contexts change. &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;Config variations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;config targeting&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;progressive rollouts&lt;/a&gt; give teams a repeatable way to adapt while controlling exposure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Design experiments with statistical power and variance in mind&lt;/td&gt;
&lt;td&gt;Collect multiple observations per variant to account for nondeterminism. Use confidence intervals and statistical significance tests rather than single-run comparisons. Define a minimum detectable effect (MDE), guardrail metrics, and a decision rule before launch. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;experiments&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt; support this evidence-based workflow.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Support responsible and compliant AI&lt;/td&gt;
&lt;td&gt;Experimentation frameworks help teams evaluate whether updates align with ethical standards, privacy requirements, and evolving policies. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/account/role-based-access-control" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/account/approvals" rel="noopener noreferrer"&gt;approvals&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/account/audit-log" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt; help make responsible AI development a built-in process.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Keep track of cost and latency&lt;/td&gt;
&lt;td&gt;Track per-session spend and speed, set budgets and max token limits, optimize prompts and context, use caching or streaming where appropriate, and monitor TTFT, p95/p99 latency, retries, and spend. &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/metrics/autogen/ai" rel="noopener noreferrer"&gt;autogenerated AI metrics&lt;/a&gt; help surface cost, latency, and token usage by variation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conduct controlled testing with real users&lt;/td&gt;
&lt;td&gt;Run A/B tests, sticky cohorts, staged rollouts, or interleaving strategies. Measure satisfaction, task completion, latency, cost, and business impact. Use &lt;a href="https://launchdarkly.com/docs/home/flags/target-rules" rel="noopener noreferrer"&gt;targeting rules&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; to control exposure and rollback thresholds.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perform evaluation&lt;/td&gt;
&lt;td&gt;Define metrics for truthfulness, user experience, reliability, safety, cost, and speed. Test in layers and expand only when stable. Evaluation tells you whether a system meets a bar, while experimentation determines which variant should be trusted in production. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/datasets" rel="noopener noreferrer"&gt;datasets&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/judges" rel="noopener noreferrer"&gt;judges&lt;/a&gt; support layered AI evaluation workflows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use retrieval evaluation for RAG&lt;/td&gt;
&lt;td&gt;Evaluate model quality by measuring recall@k, precision@k, citation accuracy, unsupported claim rate, cost, and latency. After offline quality assessment, use live or shadow traffic for controlled experiments that optimize retrievers, chunking, ranking, or reranking. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;AgentControl experiments&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;monitoring&lt;/a&gt; help compare these changes safely.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ensure proper governance and safety for AI experimentation&lt;/td&gt;
&lt;td&gt;Pre-register your experiment plan, including hypothesis, primary metric, MDE, guardrails, and rollback rules. Version prompts, models, and configurations. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/manage" rel="noopener noreferrer"&gt;config management&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/account/approvals" rel="noopener noreferrer"&gt;approvals&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/account/audit-log" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt; help preserve compliance, safety, and auditability.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Testing different chunking or embedding models usually requires building and validating separate vector indexes, and sometimes separate databases, because embeddings are tied to the index schema. Swapping these at inference time requires architectural planning, reindexing, and migration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why AI Needs Experimentation
&lt;/h2&gt;

&lt;p&gt;Traditional software works like a calculator: same input, same output. AI is more like a conversational assistant that can be helpful and creative but sometimes surprising. Since AI is not fully predictable and small changes in wording can shift results, you cannot judge the quality of an AI feature from a single right answer.&lt;/p&gt;

&lt;p&gt;AI features are pipelines with many moving parts: models that may update, prompts that steer behavior, tools and APIs that can fail, and knowledge sources that drift as content changes. All of these can affect accuracy, safety, speed, and cost. A one-time test will not catch issues that appear under real traffic.&lt;/p&gt;

&lt;p&gt;That is why experimentation is essential. It gives teams a structured way to observe, measure, and improve AI behavior as conditions change. Through continuous testing, you can detect drift, uncover hidden risks, and build confidence that your system performs reliably and responsibly.&lt;/p&gt;

&lt;p&gt;LaunchDarkly helps teams operationalize this workflow with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol" rel="noopener noreferrer"&gt;AgentControl&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;configs&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;config variations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;config targeting&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;monitoring&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hierarchy of Levers: Where to Focus Your Optimization Efforts
&lt;/h2&gt;

&lt;p&gt;In practice, AI experimentation levers should be optimized in order of impact and reversibility:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;System message&lt;/li&gt;
&lt;li&gt;Examples&lt;/li&gt;
&lt;li&gt;Output format&lt;/li&gt;
&lt;li&gt;Context&lt;/li&gt;
&lt;li&gt;Retries and fallbacks&lt;/li&gt;
&lt;li&gt;Models and parameters&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This order matters because many high-impact changes can be made without retraining or rebuilding your system. With &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;AgentControl config variations&lt;/a&gt;, teams can version and compare these changes while controlling exposure through &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;targeting&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  System Message Variations
&lt;/h3&gt;

&lt;p&gt;The system message is one of the most powerful levers in shaping an AI model’s behavior. It defines the model’s role, tone, and boundaries, setting the personality and guardrails for how it responds.&lt;/p&gt;

&lt;p&gt;Small changes here can dramatically affect safety and reliability. Tightening tone or adding an out-of-scope clause can prevent speculative or unsafe content. However, overly rigid instructions can make responses sound robotic or unhelpful.&lt;/p&gt;

&lt;p&gt;Experiment with several system-message variations and test how they perform across normal, edge, and adversarial scenarios. The goal is not only to find one prompt that works, but to understand how tone and framing influence quality, cost, safety, and latency. Store and compare these variants with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;AgentControl config variations&lt;/a&gt; and monitor results with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;config performance monitoring&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choosing the Right Number of Examples
&lt;/h3&gt;

&lt;p&gt;Compare zero-shot, one-shot, and few-shot examples, typically 3-5 examples. Mix common cases and edge cases, include “do” and “don’t” examples, and show the exact output format. Short examples teach patterns, but they also add tokens and delay. Measure accuracy, format adherence, generalization, latency, and cost with &lt;a href="https://launchdarkly.com/docs/home/metrics/autogen/ai" rel="noopener noreferrer"&gt;autogenerated AI metrics&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output Format
&lt;/h3&gt;

&lt;p&gt;Choose between free text, structured templates, or native structured outputs. Structured outputs are easier to parse and validate but can constrain creativity or break on truncation. Always validate responses, handle partial outputs gracefully, and keep templates simple. During testing, a temporary explain field can help diagnose why one variation performs better than another.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Window Size
&lt;/h3&gt;

&lt;p&gt;Your experiment should test the cost-benefit tradeoff between precise context and extended context. Increasing context often increases cost and latency without improving output quality. Use &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt; to compare variation-level latency and token usage before promoting a longer-context variant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retries With Backoff
&lt;/h3&gt;

&lt;p&gt;Use one or two attempts for temporary errors such as rate limits, timeouts, or server overload. Add exponential backoff and jitter. Log error rates, latency, and cost. Ensure idempotency, cap retries, enforce timeouts, and offer a polite fallback when limits are hit. For production rollout, pair retry changes with &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; so latency and error regressions can halt expansion.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fallback Chain
&lt;/h3&gt;

&lt;p&gt;Route to a backup model or provider in the event of failures or slowness. Keep prompts and formats aligned so the backup model understands the same prompt structure and returns responses in the same format. Preserve conversation state, verify required features on the fallback, and log reasons for routing. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;config targeting&lt;/a&gt; can help route different cohorts to different model or provider variations.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Expansion Rule
&lt;/h3&gt;

&lt;p&gt;Experimentation should scale based on evidence, not enthusiasm. Once your pilot shows strong performance, expand the rollout to broader audiences. Scale only when metrics justify it: success rates are high, failure rates are low, safety checks pass, and time or cost remains acceptable. Use &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;progressive rollouts&lt;/a&gt;, or &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; to expand with controlled risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models and Parameters
&lt;/h2&gt;

&lt;p&gt;Models and parameters are the tuning panel for an AI system: the set of dials you use when you want more accuracy, fewer hallucinations, faster responses, or lower cost.&lt;/p&gt;

&lt;p&gt;Start with the right model for the job. Use a more capable model for complex reasoning or planning and a smaller, faster model for routine tasks. Match the model’s strength to the complexity and stakes of the task rather than defaulting to the largest model. Lock down the exact model version when possible so results stay reproducible as the model evolves. Version pinning reduces variability, but it does not eliminate drift. Upstream model behavior and real-world inputs can still change, so production experiments and ongoing holdbacks remain necessary.&lt;/p&gt;

&lt;p&gt;AgentControl lets teams manage model selection, prompt content, provider configuration, and generation parameters with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;configs&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;variations&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-model-config" rel="noopener noreferrer"&gt;AI model configurations&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temperature
&lt;/h3&gt;

&lt;p&gt;Temperature controls how adventurous or conservative a model’s output is. It is the primary generation setting most users adjust.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep it low, around 0-0.3, for code, structured formats, or safety-critical tasks.&lt;/li&gt;
&lt;li&gt;Use higher values, around 0.7-1.0, for creativity or brainstorming.&lt;/li&gt;
&lt;li&gt;Stay in the middle for everyday conversations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other sampling parameters, such as top_p or top_k, also influence output diversity, but temperature usually has the largest and most predictable effect, so it is often the first parameter worth tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval and Search
&lt;/h3&gt;

&lt;p&gt;Do not rely only on keywords because meaning matters. Semantic search helps the model understand intent. Hybrid search, combining semantic and keyword search, often works best for short queries or exact names. Choose an embedding model that fits your language and domain, and keep its version fixed.&lt;/p&gt;

&lt;p&gt;A graph database models relationships and traversals, such as “how is X connected to Y?” A vector database or vector-enabled datastore is optimized for similarity search over embeddings to support retrieval in RAG pipelines. When testing retrieval changes, use &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;AgentControl experiments&lt;/a&gt; to compare quality, latency, and cost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking and Metadata
&lt;/h3&gt;

&lt;p&gt;Split documents into natural sections with slight overlaps. Sliding windows help for long text. Add metadata to improve filtering and relevance. When experimenting, start with a baseline and change one variable at a time: temperature, chunk size, top_k, reranking, or search type. Evaluate offline using a labeled dataset from your domain, then use controlled rollout strategies such as &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt; or &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; before broad exposure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool and Function Management
&lt;/h2&gt;

&lt;p&gt;Tools are the hands and eyes of your AI. They turn abstract intelligence into real-world action. However, giving an AI system too many tools at once can create reliability, safety, and cost problems. A focused, well-defined toolset keeps the system efficient and predictable.&lt;/p&gt;

&lt;p&gt;When experimenting with tools, start small. Give the AI only the tools it truly needs, then expand based on evidence. Simulate tool behavior with mock or historical data before allowing live writes or sensitive operations. Monitor error rates, latency, and cost. Use circuit breakers, fallback paths, and kill switches to keep the system stable when a tool fails.&lt;/p&gt;

&lt;p&gt;LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/tools" rel="noopener noreferrer"&gt;AgentControl tools&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/agents" rel="noopener noreferrer"&gt;agents&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flags&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/releasing" rel="noopener noreferrer"&gt;release controls&lt;/a&gt; can help teams expose new tool behavior gradually and roll back unsafe changes quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and Latency
&lt;/h2&gt;

&lt;p&gt;Managing cost and latency in AI systems is like tuning a race car: you want speed and performance, but you cannot afford to burn all your fuel in one lap. The trick is knowing where your money and time actually go: input tokens, output tokens, model rates, tool usage, retries, retrieval, and post-processing.&lt;/p&gt;

&lt;p&gt;Experiment design also affects cost. Multi-armed bandit approaches can reduce spend by shifting traffic away from losing variants early, while long, fixed-horizon A/B tests can waste budget after a clear loser emerges. Track cost per successful answer rather than cost per call so you know which variants are efficient and useful.&lt;/p&gt;

&lt;p&gt;Several habits help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Match the model to the job:&lt;/strong&gt; Use smaller models for routine tasks and larger models for complex reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set clear budgets:&lt;/strong&gt; Cap tokens, cost, and retries per session.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache and reuse:&lt;/strong&gt; Avoid paying twice for the same retrieval or generated output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry wisely:&lt;/strong&gt; Validate inputs early and use exponential backoff to avoid waste.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure what matters:&lt;/strong&gt; Track cost per successful answer, not just cost per request.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch latency signals:&lt;/strong&gt; Monitor time to first token, p95/p99 latency, and error rates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;Monitoring&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/metrics/autogen/ai" rel="noopener noreferrer"&gt;autogenerated AgentControl metrics&lt;/a&gt; help teams compare token usage, duration, and variation-level performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Experimentation Before User Exposure
&lt;/h2&gt;

&lt;p&gt;Before any major AI update reaches real users, it deserves a proper dress rehearsal. Catching issues early prevents bad experiences, unnecessary costs, and reputational damage.&lt;/p&gt;

&lt;p&gt;Start by building a test set that mirrors real-world scenarios: genuine examples, synthetic edge cases, and adversarial prompts. If you are working with RAG, make sure answers link back to sources so you can evaluate grounding and citation quality. Use an AI judge or evaluation rubric to score correctness, completeness, clarity, safety, and faithfulness. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/datasets" rel="noopener noreferrer"&gt;datasets&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/judges" rel="noopener noreferrer"&gt;judges&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/offline-evaluations" rel="noopener noreferrer"&gt;offline evaluations&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt; support this progression from offline testing to production measurement.&lt;/p&gt;

&lt;p&gt;Best practices include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Set clear thresholds:&lt;/strong&gt; Define what “good enough” means before the test begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shadow test safely:&lt;/strong&gt; Run the new model alongside the current one on real traffic while hiding results from users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control costs:&lt;/strong&gt; Sample requests, cache results, and limit verbosity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protect fairness and privacy:&lt;/strong&gt; Compare variants across quality, reliability, cost, and speed while respecting data boundaries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the new model shows stable performance, no quality drops, no latency spikes, and no cost overruns, move to a canary rollout with rollback ready. Use &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt; for fixed exposure, &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;progressive rollouts&lt;/a&gt; for scheduled expansion, and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; when you want metric-based monitoring and rollback.&lt;/p&gt;

&lt;h2&gt;
  
  
  Controlled Testing With Real Users
&lt;/h2&gt;

&lt;p&gt;Testing with real users is where theory meets reality. The goal is to gather insight while keeping risk low and user experience intact.&lt;/p&gt;

&lt;p&gt;A practical way to do this is A/B testing. By assigning users to consistent cohorts, you can compare different versions of your AI system under real conditions. This supports statistical decision-making, such as confidence intervals and significance testing, rather than anecdotal wins.&lt;/p&gt;

&lt;p&gt;To make tests meaningful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Keep traffic splits representative:&lt;/strong&gt; Cover different user segments, regions, and use cases with &lt;a href="https://launchdarkly.com/docs/home/flags/target-rules" rel="noopener noreferrer"&gt;targeting rules&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag everything:&lt;/strong&gt; Include version, prompt, model, parameters, and settings in every request so outcomes are traceable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure real impact:&lt;/strong&gt; Track satisfaction, edits, retries, task completion, conversion, revenue lift, latency, and cost with &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;LaunchDarkly metrics&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When rolling out updates, start with an internal beta, then gradually expand to 1%, 5%, 10%, and beyond. Watch quality, latency, safety, and failure rates closely. If something goes wrong, roll back and investigate. &lt;a href="https://launchdarkly.com/docs/home/releases/creating-guarded-rollouts" rel="noopener noreferrer"&gt;Creating guarded rollouts&lt;/a&gt; gives teams a structured way to tie rollout expansion to live metrics.&lt;/p&gt;

&lt;p&gt;Not all AI experiments have a fixed end date. Many teams run ongoing control groups, holdbacks, or adaptive allocation strategies that monitor performance as models, data, and user behavior change. Even then, explicit guardrails and rollback thresholds are essential so optimization never trades off safety, latency, or cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation
&lt;/h2&gt;

&lt;p&gt;Evaluation is not just checking whether the model runs. It is understanding how well the system serves users, how reliable it is under real conditions, and whether it delivers value within operational limits. A strong evaluation framework balances quality, cost, safety, and performance.&lt;/p&gt;

&lt;p&gt;Layer evaluations in stages: offline tests, shadow testing, limited rollout, and broader production experiments. Set clear targets for quality, reliability, and cost. Instrument everything so you can explain wins and diagnose regressions. Expand only when metrics hold steady.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quality and Accuracy
&lt;/h3&gt;

&lt;p&gt;Start with the basics: Does the model tell the truth? Validate answers against known ground truth using offline tests and side-by-side reviews. AI judges provide scalable signals, but they should be calibrated against human review and used primarily for relative comparison between variants rather than absolute truth. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/judges" rel="noopener noreferrer"&gt;judges&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt; help automate this scoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  User Experience
&lt;/h3&gt;

&lt;p&gt;Even a technically accurate model fails if it frustrates users. Focus on fast, helpful first responses and fewer handoffs to humans. Measure satisfaction, task completion, rewrite rates, time to first token, and time to useful answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reliability
&lt;/h3&gt;

&lt;p&gt;Reliability means tools behave predictably. Check that outputs match expected formats and that retries or timeouts are rare. Track error rates, schema validity, and success ratios. Define service-level objectives and trigger rollback if failures exceed limits. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; can connect metric regressions to automated release decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost and Speed
&lt;/h3&gt;

&lt;p&gt;Every token, retrieval, and retry has a price. Break down latency and cost by stage to identify where resources are spent. Use smaller or cached models for routine tasks, stream responses where appropriate, and tighten prompts to reduce waste.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;p&gt;You cannot improve what you cannot see. Log prompts, parameters, model versions, config versions, and tool calls while masking personal data. Feed this data into dashboards that track cost, speed, quality, and safety. Use &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/observability" rel="noopener noreferrer"&gt;observability integrations&lt;/a&gt; to detect drift and regressions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluating Retrieval Quality
&lt;/h3&gt;

&lt;p&gt;Great answers depend on great context. Assess retrievers, rerankers, and generators separately and together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recall@k shows whether the right documents appear.&lt;/li&gt;
&lt;li&gt;Precision@k shows whether retrieved documents are relevant.&lt;/li&gt;
&lt;li&gt;nDCG and MRR show how well relevant documents are ranked.&lt;/li&gt;
&lt;li&gt;Attributable accuracy connects correct answers to supporting evidence.&lt;/li&gt;
&lt;li&gt;Unsupported claim rate flags hallucinations.&lt;/li&gt;
&lt;li&gt;Citation correctness, freshness, cost, and latency show whether retrieval adds value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use offline QA sets with labeled passages and slice results by topic, query type, and language. Add confidence gating so the system can admit uncertainty instead of fabricating answers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability for Retrieval
&lt;/h3&gt;

&lt;p&gt;Instrument retrieval just like generation. Log query details, index versions, retrieved document IDs, ranking scores, and latency. Use dashboards to visualize recall, accuracy, and latency percentiles. Before rolling out a new index, use canary or shadow testing and control exposure with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;AgentControl targeting&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance and Safety in AI Experimentation
&lt;/h2&gt;

&lt;p&gt;Governance and safety keep AI experimentation trustworthy. The goal is to find measurable improvement while protecting users, respecting constraints, and keeping experiments reproducible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security and Access Control
&lt;/h3&gt;

&lt;p&gt;Before any experiment touches real data or users, define who can change what and how. Limit who can modify prompts, deploy models, access production logs, or adjust rollout rules. Use separate environments for development, staging, and production. LaunchDarkly supports these practices through &lt;a href="https://launchdarkly.com/docs/home/account/role-based-access-control" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/account/approvals" rel="noopener noreferrer"&gt;approvals&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/account/audit-log" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/account/environment" rel="noopener noreferrer"&gt;environments&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety Guardrails
&lt;/h3&gt;

&lt;p&gt;Set hard limits that experiments cannot violate. Use content filters, rate limits, token budgets, circuit breakers, and quality thresholds. Define rollback conditions for error rates, latency spikes, toxicity, unsupported claims, or cost overruns. &lt;a href="https://launchdarkly.com/docs/home/releases/release-policies" rel="noopener noreferrer"&gt;Release policies&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; help standardize these controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducibility and Compliance
&lt;/h3&gt;

&lt;p&gt;Strong governance means being able to prove what happened. Fix random seeds or sampling settings where supported. Version dataset snapshots, model IDs, prompt templates, guardrails, and configuration files. Store experiment plans, analysis rules, inputs, outputs, parameters, and results. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/manage" rel="noopener noreferrer"&gt;config management&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/compare-config-versions" rel="noopener noreferrer"&gt;config version comparison&lt;/a&gt; help preserve reproducibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rollback and Kill Switches
&lt;/h3&gt;

&lt;p&gt;No matter how careful you are, things can go wrong. Keep the previous version ready. Test rollback procedures regularly. Use kill switches that can immediately halt an experiment if safety or quality issues emerge. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flags&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/managing-guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollout management&lt;/a&gt; support fast mitigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ongoing Monitoring
&lt;/h3&gt;

&lt;p&gt;Governance does not stop at launch. Continue tracking model performance, user behavior, data distributions, latency, cost, and safety. Periodically rerun safety and quality checks as the system evolves. Maintain a documented process for investigating failures, notifying stakeholders, and implementing fixes.&lt;/p&gt;

&lt;h2&gt;
  
  
  How LaunchDarkly Helps With AI Experimentation
&lt;/h2&gt;

&lt;p&gt;Where many AI tools stop at evaluation, LaunchDarkly helps enable production experimentation with traffic allocation, metrics, statistical comparison, and controlled release workflows. AI experimentation needs an operational layer that manages prompts, models, parameters, cohorts, traffic allocation, and rollouts safely. Building that layer yourself can quickly become complex.&lt;/p&gt;

&lt;p&gt;LaunchDarkly provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instant updates without deployments:&lt;/strong&gt; Change prompts, swap models, or adjust parameters through &lt;a href="https://launchdarkly.com/docs/home/agentcontrol" rel="noopener noreferrer"&gt;AgentControl&lt;/a&gt; without redeploying application code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safe, gradual rollouts:&lt;/strong&gt; Test new models on a small percentage of users with &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;progressive rollouts&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized control with governance:&lt;/strong&gt; Use &lt;a href="https://launchdarkly.com/docs/home/account/approvals" rel="noopener noreferrer"&gt;approvals&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/account/audit-log" rel="noopener noreferrer"&gt;audit logs&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/account/role-based-access-control" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in experimentation framework:&lt;/strong&gt; Run &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;experiments&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;AgentControl experiments&lt;/a&gt; comparing models, prompts, or parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separation of concerns:&lt;/strong&gt; Developers can focus on building features while cross-functional teams safely participate in experimentation workflows through controlled configuration changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LaunchDarkly feature flags and AgentControl let you treat AI components as dynamic configurations rather than static code, giving you the speed and safety needed for continuous experimentation at scale. The following example shows how to switch between two model configurations with AgentControl.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: Switching Between AI Model Variations With AgentControl
&lt;/h3&gt;

&lt;p&gt;In the LaunchDarkly dashboard, open AI, select AgentControl, create a config for the AI workflow, and define variations for each model you want to compare. For implementation details, start with the &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/quickstart" rel="noopener noreferrer"&gt;AgentControl quickstart&lt;/a&gt;, then review &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;Create configs&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;Create and manage config variations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;Config targeting&lt;/a&gt;, and the &lt;a href="https://launchdarkly.com/docs/sdk/ai/python" rel="noopener noreferrer"&gt;Python AI SDK reference&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After setting up config variations, use targeting to control which model variation is served and define a safe default.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This example is simplified for illustration. Production implementations should externalize secrets, define explicit fallbacks, enforce timeouts, and include error handling and guardrails.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Install Dependencies
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Install required dependencies.
# In a notebook, you can run these commands with a leading !.
# In a terminal, run them without the leading !.
&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;launchdarkly&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;launchdarkly&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ai&lt;/span&gt;
&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Import Dependencies
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Set Up Clients
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;openai_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;openai_api_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;openai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;openai_api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Create Evaluation Contexts
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Context 1: control group user.
&lt;/span&gt;&lt;span class="n"&gt;context_user_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-alpha-001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firstName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lastName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Anderson&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice@example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experimentGroup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Context 2: treatment group user.
&lt;/span&gt;&lt;span class="n"&gt;context_user_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-beta-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;firstName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lastName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Baker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bob@example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experimentGroup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;treatment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Run the Same Query Against Two Config Variations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fallback_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;user_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a detailed essay on NASA&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_configured_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ai-experimentation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fallback_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI config is disabled for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_openai_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxTokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Response:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;


&lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_configured_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_user_a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User A: Control&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_configured_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_user_b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User B: Treatment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Comparison&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User A got: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User B got: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using the code above, one user can receive a baseline model variation while another receives an experimental model variation, without requiring a redeploy. This makes it easier to compare quality, latency, and cost under controlled conditions. The AI SDK can also report metrics to LaunchDarkly, which you can review in &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt; and use in &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;AgentControl experiments&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Experimentation should be part of everyday AI work: a habit, not a one-off project. Keep iterating, version your data, and let real numbers guide decisions instead of hunches. Treat every AI change like a hypothesis. Every hypothesis should map to a traffic allocation strategy, decision rule, and rollback condition.&lt;/p&gt;

&lt;p&gt;Change one thing at a time. Start offline, move to shadow testing, then gradually expand through controlled rollouts while tracking quality, cost, latency, safety, and user outcomes. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol" rel="noopener noreferrer"&gt;AgentControl&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;config variations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt;, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/experimentation" rel="noopener noreferrer"&gt;experiments&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; make this process practical by keeping prompts, models, parameters, metrics, and release controls versioned, targetable, measurable, and reversible.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
      <category>mlops</category>
    </item>
    <item>
      <title>MLOps Lifecycle: Stages, Workflow, and Best Practices</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Tue, 02 Jun 2026 16:48:26 +0000</pubDate>
      <link>https://dev.to/launchdarkly/mlops-lifecycle-stages-workflow-and-best-practices-227d</link>
      <guid>https://dev.to/launchdarkly/mlops-lifecycle-stages-workflow-and-best-practices-227d</guid>
      <description>&lt;p&gt;A machine learning model that performs well on day one will not remain stable by default. Performance can degrade over time due to data drift, changes in user behavior, evolving feature sets, or updates to upstream systems. These changes rarely cause immediate failure, but they reduce reliability and make model behavior harder to understand.&lt;/p&gt;

&lt;p&gt;The core issue is not model quality, but a lack of coordination across the lifecycle. Decisions made early in the lifecycle affect every stage that follows. When stages operate in isolation, traceability breaks down. For example, code versioning may capture model changes, but not dataset lineage, feature definitions, or runtime behavior.&lt;/p&gt;

&lt;p&gt;MLOps addresses this by treating machine learning as a continuous, end-to-end lifecycle. It connects data, features, training, deployment, monitoring, and governance into a single operating model. Each stage introduces its own assumptions and dependencies, from training and validation to deployment, monitoring, and governance.&lt;/p&gt;

&lt;h5&gt;
  
  
  Summary of key MLOps lifecycle concepts
&lt;/h5&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Activities and Outputs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Ingestion and Labeling&lt;/td&gt;
&lt;td&gt;Collect raw data (&lt;a href="https://launchdarkly.com/docs/home/observability/logs" rel="noopener noreferrer"&gt;logs&lt;/a&gt;, databases, APIs, and sensors), annotate or label it if necessary, and clean it. The output will be versioned datasets or snapshots.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature Engineering&lt;/td&gt;
&lt;td&gt;Take raw data and transform it into features (e.g., normalization, encoding, and aggregation) and register these features in a feature store.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Training and &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;Experimentation&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Perform training jobs and hyperparameter tuning. The output of this stage will be trained model artifacts like weights and checkpoints.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Validation and Testing&lt;/td&gt;
&lt;td&gt;Test new models against holdout or test data. The output will be accuracy, loss, fairness &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt;, and validation reports.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packaging and CI/CD&lt;/td&gt;
&lt;td&gt;Package the model into a deployable artifact or container and push it to a model registry or a container registry.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment and Rollout&lt;/td&gt;
&lt;td&gt;Deploy the model to production (REST endpoint, batch service, etc.). Manage traffic with &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;canary releases&lt;/a&gt; and/or blue-green deployments. For LLM applications, &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;Configs&lt;/a&gt; extends these capabilities to prompt versioning and &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-model-config" rel="noopener noreferrer"&gt;model provider management&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring and &lt;a href="https://launchdarkly.com/docs/home/observability" rel="noopener noreferrer"&gt;Observability&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;Monitor system health: latency, error rates, etc. Monitor machine learning health, including elements like prediction quality and data drift.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback and Retraining&lt;/td&gt;
&lt;td&gt;Collect new labeled data and initiate the process of retraining the model. Schedule retraining runs using the newly collected data.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Governance and Approval&lt;/td&gt;
&lt;td&gt;Conduct human-in-the-loop reviews and compliance checks before deploying the model. Maintain documentation of the models (e.g., model cards and data sheets), and implement automated policy checks.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The following diagram shows how the major MLOps lifecycle stages connect in practice, from data ingestion through deployment, monitoring, and retraining, along with the operational outputs produced at each step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ibqagnw0fclooyb0bnp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ibqagnw0fclooyb0bnp.png" alt="MLOps lifecycle stages from data ingestion through governance" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data ingestion and preparation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data as a first-class production artifact
&lt;/h3&gt;

&lt;p&gt;Most ML systems do not make data ingestion a control boundary, instead treating it as a background process. Initially, everything looks good, but then issues creep in, such as missing columns, silent null propagation, schema changes, late arrival of upstream data, or unknown outliers. There is no catastrophic failure, just gradual degradation of model performance, making it hard to debug and identify exactly what original data was used.&lt;/p&gt;

&lt;p&gt;Data ingestion should be a first-class citizen in the MLOps workflow. It is essential to establish reproducibility, compliance, and reliability for models. Determinism and measurable data quality should be achieved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion as a control layer
&lt;/h3&gt;

&lt;p&gt;Data ingestion must control the entry of all data being routed and validated. Data should be collected either in batches or streams before undergoing deterministic data cleansing transformations. Before any data is saved, schema requirements should be validated. In addition, each ingestion point should create a snapshot or version for future reference. Data lineage and quality &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt; should be recorded at every stage along the processing route, so if validation fails, training on that data stops completely.&lt;/p&gt;

&lt;p&gt;In MLOps, one key operational choice is whether a system should be fail-closed or fail-open. Fail-closed systems stop processing as soon as an anomaly is detected, maximizing safety. Fail-open systems continue processing with fallback logic, maximizing availability. The decision should depend on business risk, not the default implementation.&lt;/p&gt;

&lt;p&gt;The pseudocode below shows a simplified ingestion control flow: load raw data, validate its schema, apply deterministic transformations, measure drift, and then store the resulting dataset version and metadata for downstream training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_from_source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="nf"&gt;validate_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_transformations&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;null_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;null_handling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;outlier_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outlier_policy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;drift_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_drift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;drift_score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drift_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Distribution shift detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dataset_version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;snapshot_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;store_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset_version&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;drift_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For high-risk ML workflows such as regulated decisions, fraud detection, or safety-sensitive systems, ingestion pipelines should usually fail closed. In lower-risk cases, teams may choose fail-open behavior with explicit fallback logic, but that should be a conscious business decision rather than an implicit default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic validation signals
&lt;/h3&gt;

&lt;p&gt;Deterministic validation means data checks that always produce the same pass/fail outcome for the same data based on predefined rules. If a required column disappears, a null rate exceeds an allowed threshold, or a distribution shift crosses a defined limit, the pipeline should respond predictably every time. These checks are often the first reliable sign of upstream data problems, such as schema changes, silent null propagation, or newly introduced categorical values.&lt;/p&gt;

&lt;p&gt;In addition to checking whether columns exist, validating data effectively should include the following aspects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determining null counts and validating other attribute values&lt;/li&gt;
&lt;li&gt;Validating that attribute values fall into the correct range&lt;/li&gt;
&lt;li&gt;Limiting the number of categories available for categorical attributes&lt;/li&gt;
&lt;li&gt;Measuring distributional shifts in an attribute through either a PSI or KS test&lt;/li&gt;
&lt;li&gt;Measuring the number of duplicate records before any data goes into your model at all&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operational validation heuristics
&lt;/h3&gt;

&lt;p&gt;In practice, ingestion validation is implemented as a set of operational heuristics that help teams interpret failures quickly. The signal itself matters, but so does what it usually implies operationally, because that determines whether the right response is to stop the pipeline, investigate upstream systems, or trigger a fallback path.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Signal&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Missing required column&lt;/td&gt;
&lt;td&gt;Usually indicates that an upstream schema or API contract changed and downstream transformations may no longer be valid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Null rate &amp;gt; threshold&lt;/td&gt;
&lt;td&gt;Often suggests corrupted source records, partial extraction failures, or broken joins in the upstream pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distribution drift &amp;gt; threshold&lt;/td&gt;
&lt;td&gt;May indicate a change in user behavior, source population, collection logic, or rollout conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High duplicate rate&lt;/td&gt;
&lt;td&gt;Often points to replayed ingestion jobs, duplicate event delivery, or broken deduplication logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unseen categories&lt;/td&gt;
&lt;td&gt;Can break encoders or produce invalid feature mappings if serving logic was built against a fixed category set&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Data versioning and lineage
&lt;/h3&gt;

&lt;p&gt;Having immutable dataset snapshots is critically important to ensure reproducible results. To allow reproducible training runs, each training run must reference the dataset version ID, schema hash, transformation configuration, and associated quality &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt;. Without versioning, retraining becomes non-deterministic.&lt;/p&gt;

&lt;p&gt;In regulated &lt;a href="https://launchdarkly.com/docs/home/account/environment" rel="noopener noreferrer"&gt;environments&lt;/a&gt;, ingestion needs to automatically enforce PII masking, field-level anonymization, and retention tagging. These controls should be enforced automatically as part of the ingestion pipeline rather than handled through ad hoc manual review because manual compliance steps are hard to audit and easy to bypass under delivery pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration and &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flag&lt;/a&gt; controls
&lt;/h3&gt;

&lt;p&gt;In mature ML systems, ingestion rules should be controlled through external configuration rather than hard-coded into pipeline logic. This allows teams to adjust schema strictness, null-handling rules, drift thresholds, and anonymization behavior without redeploying the pipeline. The YAML below shows one way to define those ingestion policies declaratively.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://raw/customer_data"&lt;/span&gt;
  &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schemas/customer_v3.yaml"&lt;/span&gt;
  &lt;span class="na"&gt;null_handling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impute_median"&lt;/span&gt;
  &lt;span class="na"&gt;outlier_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip_99_percentile"&lt;/span&gt;
  &lt;span class="na"&gt;drift_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.1&lt;/span&gt;

&lt;span class="na"&gt;validation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enforce_strict_schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;max_null_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feature flags can control behaviors such as strict schema validation, drift blocking, and auto-anonymization. This enables the gradual introduction of more stringent validation, with instant rollback if the rules block production unexpectedly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion-level operating &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;The ingestion stage should expose a small set of operating &lt;a href="https://launchdarkly.com/docs/home/metrics" rel="noopener noreferrer"&gt;metrics&lt;/a&gt; so teams can tell whether data is arriving on time, passing validation, and staying within expected quality bounds. These are stage-specific signals used to manage data intake, not a replacement for the broader production monitoring discussed later in the article.&lt;/p&gt;

&lt;p&gt;Data intake needs to be measurable.&lt;/p&gt;

&lt;p&gt;Key metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch success rate&lt;/li&gt;
&lt;li&gt;Ingestion latency&lt;/li&gt;
&lt;li&gt;Drift score per batch&lt;/li&gt;
&lt;li&gt;Null rate per critical feature&lt;/li&gt;
&lt;li&gt;Rejected batch percentage&lt;/li&gt;
&lt;li&gt;Schema violation count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because ingestion is the first control boundary in the lifecycle, failures and drift detected here often surface before model-level symptoms appear in production. When ingestion is declarative, versioned, validated, and measurable, downstream training and deployment become far more reproducible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Versioned dataset snapshots&lt;/li&gt;
&lt;li&gt;Validation reports and schema versions&lt;/li&gt;
&lt;li&gt;Recorded data quality metrics&lt;/li&gt;
&lt;li&gt;Metadata required for reproducibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feature engineering
&lt;/h2&gt;

&lt;p&gt;Feature engineering is the lifecycle stage where raw, validated data is converted into the model inputs used during training and inference. In MLOps, this stage matters because feature definitions must remain consistent across offline training and online serving. If the transformation logic differs between those &lt;a href="https://launchdarkly.com/docs/home/account/environment" rel="noopener noreferrer"&gt;environments&lt;/a&gt;, the model may behave well in evaluation but degrade in production due to training-serving skew.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c0hk4dvf4gjg9t284qc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c0hk4dvf4gjg9t284qc.png" alt="Feature engineering transforms raw data into model-ready features" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining the feature contract before transformation
&lt;/h3&gt;

&lt;p&gt;With robust ML systems, feature definitions serve as the single source of truth; the transformation code simply implements them. The use of a feature-first approach helps make transformations deterministic, reducing the risk of training-serving skew. This consistency must extend across both offline feature stores used for training and backtesting, and online feature stores used for real-time inference. Aligning these environments helps prevent silent feature drift, invalid values, or data corruption in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic feature transformations
&lt;/h3&gt;

&lt;p&gt;Feature transformations should be deterministic: The same input should produce the same output when the same feature definition and configuration are applied. This is what allows training, backtesting, and live inference to remain aligned. Tools such as &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;Pandas&lt;/a&gt;, &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Spark&lt;/a&gt;, or feature platforms such as &lt;a href="https://feast.dev/" rel="noopener noreferrer"&gt;Feast&lt;/a&gt; can be used to implement that logic.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;

&lt;span class="c1"&gt;# Example: Scaling numeric features
&lt;/span&gt;&lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;scaled_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;income&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;

&lt;span class="c1"&gt;# Example: Encoding categorical features
&lt;/span&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sparse_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# scikit-learn &amp;lt; 1.2 uses the sparse parameter name.
&lt;/span&gt;    &lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OneHotEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sparse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;encoded_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gender&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unit tests and train-serving consistency
&lt;/h3&gt;

&lt;p&gt;Unit tests help verify both transformation correctness and train-serving consistency. In practice, that means confirming that the same feature logic used during training is also used when live requests are processed in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.preprocessing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StandardScaler&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_feature_scaling&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;df_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
    &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StandardScaler&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Scaling check
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ensure that the same transformation logic is applied during both training and serving to prevent training-serving skew. Automate feature value validation before training, which can include range and null checks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring feature distributions
&lt;/h3&gt;

&lt;p&gt;Teams usually encode feature-level validation rules separately from transformation code so they can check whether important features remain within expected bounds over time. The example below shows a simple configuration for monitoring a few feature ranges.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;feature_monitoring&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;features&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;age&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;income&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;purchase_count&lt;/span&gt;
  &lt;span class="na"&gt;validations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;age&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;120&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;income&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;1000000&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;purchase_count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;1000&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Feature registry and versioning
&lt;/h3&gt;

&lt;p&gt;Store feature definitions and pipelines in a feature registry to ensure consistency.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feature_set"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"customer_features"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"features"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"age"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"income"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"purchase_count"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"validation_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"passed"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;a href="https://git-scm.com/" rel="noopener noreferrer"&gt;Git&lt;/a&gt; or a feature registry to track all changes. Versioned feature pipelines support reproducibility across both training and production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Feature transformation pipelines&lt;/li&gt;
&lt;li&gt;Generated feature tables or vectors&lt;/li&gt;
&lt;li&gt;Versioned feature definitions in a registry&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Model training and &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;experimentation&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Once feature sets are available, the next stage is to train candidate models and record the &lt;a href="https://launchdarkly.com/docs/home/flags/contexts" rel="noopener noreferrer"&gt;context&lt;/a&gt; needed to reproduce and compare those runs later.&lt;/p&gt;

&lt;p&gt;Careful automation of training and experiment tracking helps improve reproducibility, consistency, and the ability to compare different models with each other at different times.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automating model training
&lt;/h3&gt;

&lt;p&gt;Whenever possible, the training process should be automated. This includes scheduling regular training runs, running hyperparameter sweeps, and retraining models when new data becomes available. Automated pipelines save time and reduce human error, especially when managing multiple models or experimenting with different parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tracking experiments
&lt;/h3&gt;

&lt;p&gt;Every model training run should be tracked to ensure reproducibility and facilitate later comparisons. This means logging the hyperparameters used, such as learning rate and number of trees, dataset snapshots, code versions, and training and validation metrics.&lt;/p&gt;

&lt;p&gt;For example, this can be done using &lt;a href="https://mlflow.org/" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt; in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;learning_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;num_trees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Training code goes here
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Log evaluation metrics
&lt;/span&gt;    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_val&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;val_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Save the trained model
&lt;/span&gt;    &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_artifact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model.pkl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method tracks all of an experiment, and you can repeat the model or compare it with any other run later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Controls and best practices
&lt;/h3&gt;

&lt;p&gt;To prevent problems during training, configure an early stopping rule and define a limit for the total number of training epochs to avoid runaway training. You should also perform integration tests after loading the trained model using sample inputs. Each trained model should be saved as a versioned artifact in your chosen artifact service, such as S3 or the MLflow Model Registry. Finally, seed random number generators to ensure deterministic training and log the seed. These practices help maintain consistency, reproducibility, and reliability across training runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Trained model artifacts (pickle, ONNX, TensorFlow SavedModel)&lt;/li&gt;
&lt;li&gt;Training &lt;a href="https://launchdarkly.com/docs/home/observability/logs" rel="noopener noreferrer"&gt;logs&lt;/a&gt; and experiment metadata&lt;/li&gt;
&lt;li&gt;Hyperparameter and dataset configuration snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Validation, testing, and evaluation
&lt;/h2&gt;

&lt;p&gt;Model evaluation starts with offline assessments using a holdout test dataset. In this stage, model performance is measured using task-appropriate measures. For classification, those measures may include accuracy, precision, recall, F1 score, ROC curve, and confusion matrix. For regression, common measures include RMSE, MAE, or R-squared. It is also necessary to evaluate domain-specific business metrics, such as conversion lift, cost of errors, or revenue impact, to ensure the deployed model provides business value in addition to statistical performance.&lt;/p&gt;

&lt;p&gt;While offline assessment provides important deployment guidance, automated checks against predetermined thresholds or recorded baselines should be part of gated validation before promotion. These checks should validate fairness or bias issues and use unit tests to confirm that known inputs return expected outputs. If a required threshold is violated, the pipeline should fail and prevent the model from being promoted to production.&lt;/p&gt;

&lt;p&gt;To maintain reliability, automate checks that compare metrics against defined thresholds or baselines. For example, the pipeline should fail if a model's accuracy falls below the previous version. The pipeline should also fail if a fairness metric for a protected group is violated. Include unit tests to confirm that the model produces correct predictions on known inputs. Only models that pass all validation checks should advance to deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Validation reports and evaluation metrics&lt;/li&gt;
&lt;li&gt;Metric visualizations (confusion matrices, ROC curves)&lt;/li&gt;
&lt;li&gt;Automated test logs and validation summaries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Packaging and CI/CD
&lt;/h2&gt;

&lt;p&gt;Once a machine learning model is validated, it should be packaged for deployment. This usually includes creating a container image, such as a Docker image, that includes the model and all code required to execute it. You can upload your model to a managed service like MLflow or Amazon S3.&lt;/p&gt;

&lt;p&gt;Packaging is about more than reproducing something; it is also about controlling its promotion. When a model artifact has been validated, it should have proper versioning, registry storage, and associated promotion paths, such as staging and production, supported by defined approval and traceability workflows. The purpose of packaging is to ensure that the deployable unit is exactly the one that was validated, with its runtime dependencies, metadata, and configuration captured in a controlled and versioned form. Following a promotion path lowers release risk and makes rollback to a previous version easier if a problem occurs.&lt;/p&gt;

&lt;p&gt;If you are using a continuous integration / continuous deployment (CI/CD) system like Jenkins, GitHub Actions, or Azure DevOps, deployment can usually be automated through the CI/CD pipeline. Typical steps involve retrieving the model from its storage location, building the Docker container image, running basic tests, and pushing the model image to a registry. Each image should contain a version tag that identifies which model version was deployed.&lt;/p&gt;

&lt;p&gt;To maintain safety and reliability, the CI/CD pipeline should run automated checks, including code validation, test requests to the container, and Docker image security scans. Always use fixed version tags rather than latest to avoid accidental overwrites. If any test fails, the pipeline should stop immediately to prevent a faulty model from being deployed.&lt;/p&gt;

&lt;p&gt;Proper packaging combined with automated CI/CD makes model deployment easier, safer, and more consistent.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frm036av3jh4llv6grkrt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frm036av3jh4llv6grkrt.png" alt="CI/CD and packaging workflow for model artifacts" width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment and runtime controls
&lt;/h2&gt;

&lt;p&gt;Deployment is the stage where a validated model is exposed to production traffic through an endpoint, batch workflow, or embedded application path. The operational goal is not just to make the model reachable but to release it in a way that limits user risk, supports rollback, and preserves &lt;a href="https://launchdarkly.com/docs/home/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt; during change.&lt;/p&gt;

&lt;p&gt;One common runtime control is a &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flag&lt;/a&gt;, which is a configurable switch that changes application behavior without requiring a redeploy. In ML systems, &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flags&lt;/a&gt; can be used to route users between model versions, limit exposure to selected cohorts, or revert quickly to a known-safe model when problems appear. Tools such as &lt;a href="https://launchdarkly.com/" rel="noopener noreferrer"&gt;LaunchDarkly&lt;/a&gt; provide this kind of runtime control.&lt;/p&gt;

&lt;p&gt;Deployment strategies are designed to minimize exposure of new models, whereas guardrails are designed to minimize risk. You can also control which users see the new model by using &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flags&lt;/a&gt; in tools like &lt;a href="https://launchdarkly.com/" rel="noopener noreferrer"&gt;LaunchDarkly&lt;/a&gt;. One way to implement &lt;a href="https://launchdarkly.com/docs/home/flags" rel="noopener noreferrer"&gt;feature flags&lt;/a&gt; is by wrapping your inference code with a toggle that allows you to use the new model or fall back to the old model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;new-model-enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;old_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;features&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach supports gradual rollouts; you can start by directing a small percentage of real traffic to the new model and increasing exposure only if metrics remain strong.&lt;/p&gt;

&lt;p&gt;Always have a rollback plan. Monitor the canary release closely, and if errors rise or latency spikes, revert the feature flag and redeploy the previous model.&lt;/p&gt;

&lt;p&gt;To maintain reliability, track latency and error rates for unusual patterns. Conduct integration tests in a staging environment before promoting a model to production. Log every deployment event, and to prevent user impact, trigger alerts or automated rollbacks if any service-level agreements (SLAs) are breached.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Running model endpoints (Kubernetes deployments or cloud inference services)&lt;/li&gt;
&lt;li&gt;Feature flag configurations controlling rollout&lt;/li&gt;
&lt;li&gt;Traffic routing and rollout policies&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Monitoring and &lt;a href="https://launchdarkly.com/docs/home/observability" rel="noopener noreferrer"&gt;observability&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Unlike the ingestion-level operating metrics discussed earlier, this stage focuses on production-wide monitoring of the live ML system after deployment, including both infrastructure behavior and model behavior under real traffic.&lt;/p&gt;

&lt;p&gt;Once a model is deployed in production, it is essential to continuously monitor both the system and the model, which allows for early detection of issues and ensures that the model continues to perform as expected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observing system and model metrics
&lt;/h3&gt;

&lt;p&gt;Monitoring should include both infrastructure and model metrics.&lt;/p&gt;

&lt;p&gt;Infrastructure metrics monitor the system's health and performance. Here are some examples.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU and GPU usage&lt;/td&gt;
&lt;td&gt;Ensure that compute resources are not overloaded&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory consumption&lt;/td&gt;
&lt;td&gt;Avoid memory bottlenecks that could slow down inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;Track the number of requests the system handles per second&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;Monitor response times to maintain consistent performance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Model metrics track the model's performance in production.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prediction distributions&lt;/td&gt;
&lt;td&gt;Detect unusual patterns or shifts in model outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live accuracy&lt;/td&gt;
&lt;td&gt;Measure accuracy on recently labeled data to catch performance drops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error rates&lt;/td&gt;
&lt;td&gt;Monitor mispredictions or failures to quickly identify anomalies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparing these metrics against training baselines helps you detect data drift. For example, changes in input feature distributions can be measured using KL divergence or the population stability index.&lt;/p&gt;

&lt;p&gt;Concept drift should also be tracked; this occurs when a model's performance declines over time without code changes. Unexpected shifts in feature correlations or drops in model quality are strong indicators that something in the data or environment has changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-time &lt;a href="https://launchdarkly.com/docs/home/observability/dashboards" rel="noopener noreferrer"&gt;dashboards&lt;/a&gt; and alerts
&lt;/h3&gt;

&lt;p&gt;A key tool for monitoring is a real-time dashboard that displays prediction histograms, feature drift charts, and alert counts. Dashboards facilitate quick problem detection as issues arise, providing automated alerts when thresholds are exceeded and sending alerts through channels such as email or text for different severity levels. Minor drifts may generate a helpdesk ticket, while major anomalies may page the on-call technician.&lt;/p&gt;

&lt;h3&gt;
  
  
  Explainability and logging
&lt;/h3&gt;

&lt;p&gt;For business-critical models, explainability tools can help users understand predictions and investigate why a model may be failing or drifting. All logs and metrics should be preserved and correlated, ideally within &lt;a href="https://launchdarkly.com/docs/home/observability/dashboards" rel="noopener noreferrer"&gt;dashboards&lt;/a&gt; or monitoring systems, so that any issue can be quickly traced, diagnosed, and made actionable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback loop and retraining
&lt;/h2&gt;

&lt;p&gt;A well-developed machine learning system continues to evolve after deployment. Production usage generates feedback in the form of new data, user corrections, and observed model performance, which can be used to retrain and improve the model over time. Examples include user corrections, newly added labeled examples for retraining, or additional incoming data generated through real usage.&lt;/p&gt;

&lt;p&gt;There are numerous options for initiating retraining. Some teams use a scheduled approach, such as retraining every month. Others use automated triggers when data drift exceeds an established threshold or when model performance drops below acceptable levels.&lt;/p&gt;

&lt;p&gt;Once retraining is triggered, the same data processing pathway used to develop the original model should be used with the newly input data: develop a new model, conduct validation, and deploy the new model to replace the original only if it passes all relevant validation. Before any model replacement, compare the new model and existing model using a common dataset.&lt;/p&gt;

&lt;p&gt;All retraining activities should be carefully documented, including the dataset version, model configuration, and performance metrics. This helps ensure full traceability and reproducibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Controlled retraining workflow
&lt;/h3&gt;

&lt;p&gt;Retraining should be triggered by explicit conditions, such as scheduled cadence, measured drift, or degraded production performance. Each run should record the dataset version, feature set version, model configuration, evaluation results, and release decision. Before fully switching to a new model, deploy it in shadow mode. In this setup, both the old and new models run side by side on the same inputs, and their outputs are compared without impacting real users. This helps identify unexpected differences early.&lt;/p&gt;

&lt;p&gt;Business metrics should also be evaluated. For example, a small A/B test can confirm whether the new model improves conversion rates, reduces errors, or lowers operational costs. If the new model performs worse than the current one, immediately revert to the old model and investigate the issue. Deployment should not proceed if performance declines.&lt;/p&gt;

&lt;p&gt;All retraining cycles should be recorded clearly with the following information: what changed, the reason for retraining, how improvements were measured, and who authorized the release. Maintaining this record makes audits easier and improves transparency.&lt;/p&gt;

&lt;p&gt;Each retraining cycle produces important outputs, including updated training datasets, newly trained model artifacts, and retraining and evaluation reports. All of these artifacts should be securely stored and versioned so they can be reviewed, audited, or reproduced in the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Closed-loop learning
&lt;/h3&gt;

&lt;p&gt;A well-integrated feedback loop links monitoring, validation, deployment, and retraining together. If a negative trend occurs or performance deviates from expectations, data retrieval can be triggered automatically. Once recent data has been processed, the updated model can replace the existing deployed model with confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Output
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Updated training datasets&lt;/li&gt;
&lt;li&gt;Newly trained model artifacts&lt;/li&gt;
&lt;li&gt;Retraining and evaluation reports&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Governance and approval
&lt;/h2&gt;

&lt;p&gt;When machine learning systems operate at scale, governance becomes essential. It is not enough for a model to function correctly from a technical standpoint; it must also be reviewed, documented, and formally approved before reaching users.&lt;/p&gt;

&lt;p&gt;Strong governance frameworks establish a clear delineation of role expectations. For example, a data scientist may develop and train a model, while an ML engineer is responsible for deployment. A governance or compliance officer may check documentation and approve the release. After passing technical testing, models must complete formal review processes that include reviewing the model card, data documentation, bias analysis, and performance reports before receiving final approval.&lt;/p&gt;

&lt;p&gt;Many organizations separate their environments into development, testing, and production. Models are promoted step by step, with each stage requiring sign-off from the appropriate team. This structured process helps ensure that no model reaches production without proper oversight and review.&lt;/p&gt;

&lt;h3&gt;
  
  
  Policy and compliance controls
&lt;/h3&gt;

&lt;p&gt;Governance should not depend solely on manual reviews. Wherever possible, it should be reinforced through automation.&lt;/p&gt;

&lt;p&gt;Policy as code involves defining governance rules directly in code. For example, the pipeline can automatically verify that the model card includes all required fields, performance metrics meet predefined thresholds, and bias evaluations have been completed.&lt;/p&gt;

&lt;p&gt;The following YAML snippet defines policies for a model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;required_model_card_fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;model_owner&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;intended_use&lt;/span&gt;
  &lt;span class="na"&gt;min_auc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.85&lt;/span&gt;
  &lt;span class="na"&gt;max_bias_diff&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;

&lt;span class="na"&gt;on_failure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;block_promotion&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any of these requirements are not satisfied, the pipeline should fail automatically.&lt;/p&gt;

&lt;p&gt;All &lt;a href="https://launchdarkly.com/docs/home/releases/approvals" rel="noopener noreferrer"&gt;approvals&lt;/a&gt; and deployments must be recorded in an &lt;a href="https://launchdarkly.com/docs/home/releases/change-history" rel="noopener noreferrer"&gt;audit log&lt;/a&gt;. Model artifacts should be securely stored and protected with signatures or checksums to prevent tampering. In regulated environments, compliance reviews must occur before a model is allowed to serve real users. Only models that pass every governance check should be permitted to reach end users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Outputs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Model approval records&lt;/li&gt;
&lt;li&gt;Audit logs of model releases&lt;/li&gt;
&lt;li&gt;Governance and compliance reports&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How LaunchDarkly supports the MLOps lifecycle
&lt;/h2&gt;

&lt;p&gt;LaunchDarkly can act as a runtime control plane for MLOps, helping to enable safer releases, faster iteration, and measurable improvements in production. Its capabilities map directly to several lifecycle stages covered in this article.&lt;/p&gt;

&lt;p&gt;A key practice in ML systems is the separation between deployment and release. With LaunchDarkly, teams can ship models or prompt changes behind feature flags and release them only when confidence is established. This means a new model version can be deployed to production infrastructure without any user seeing it until the flag is toggled on.&lt;/p&gt;

&lt;p&gt;For safe model rollouts, LaunchDarkly supports &lt;a href="https://launchdarkly.com/docs/home/releases/releasing" rel="noopener noreferrer"&gt;progressive delivery&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;canary releases&lt;/a&gt;. It allows teams to expose a new model version to as little as 1% of traffic and scale up gradually to 100%. Rollouts can also be targeted to specific cohorts such as internal users, particular regions, or individual tenants, giving teams fine-grained control over who experiences the new behavior.&lt;/p&gt;

&lt;p&gt;Feature flags enable dynamic control of ML functionality at runtime. A single flag can switch between Model A and Model B without redeployment, and it can provide the ability to revert to an earlier version when issues arise. Multivariate flags also allow teams to live-tune parameters such as confidence thresholds, temperature settings, top-p settings, and scoring cutoffs without changing code.&lt;/p&gt;

&lt;p&gt;The ability to quickly roll back is crucial for minimizing risk when something goes wrong. LaunchDarkly includes kill switches, which stop access to a risky model or prompt immediately without redeploying. This capability can be important during time-critical incidents.&lt;/p&gt;

&lt;p&gt;For online experimentation, LaunchDarkly supports A/B testing on real production traffic. Teams can compare model or prompt variants and measure their impact on quality metrics, latency, and cost before committing to a full rollout. The example below walks through this in detail.&lt;/p&gt;

&lt;p&gt;For GenAI and LLM applications, LaunchDarkly offers &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;Configs&lt;/a&gt;, which manage prompts, model selection, temperature, and other parameters as versioned configurations. &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;Configs&lt;/a&gt; provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;Prompt and model updates&lt;/a&gt; without redeployment&lt;/li&gt;
&lt;li&gt;Built-in metrics tracking (tokens, latency, cost per variation)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;Online Evaluations&lt;/a&gt; for automated &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;quality scoring&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Variable substitution for dynamic prompts (&lt;code&gt;{{user_tier}}&lt;/code&gt;, &lt;code&gt;{{[context](https://launchdarkly.com/docs/home/flags/contexts)}}&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To learn more, read the &lt;a href="https://docs.launchdarkly.com/home/ai-[configs](https://launchdarkly.com/docs/home/agentcontrol/create)" rel="noopener noreferrer"&gt;AgentControl documentation&lt;/a&gt;. &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;Prompt and model updates&lt;/a&gt; can be rolled out progressively and safely, just like any other feature change.&lt;/p&gt;

&lt;p&gt;Guardrails and governance are also built in. &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;Guarded rollouts&lt;/a&gt; automatically pause or roll back changes when monitored metrics regress. Approval workflows, &lt;a href="https://launchdarkly.com/docs/home/account/roles" rel="noopener noreferrer"&gt;role-based access control&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/releases/change-history" rel="noopener noreferrer"&gt;audit logging&lt;/a&gt; can support compliance and traceability practices often required by regulated environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: A/B testing ML models with LaunchDarkly
&lt;/h3&gt;

&lt;p&gt;The following example demonstrates how LaunchDarkly feature flags can be used to A/B test two ML model versions during deployment. A &lt;a href="https://launchdarkly.com/docs/home/flags/types" rel="noopener noreferrer"&gt;string-type feature flag&lt;/a&gt; named model-version is created with two variations, model-a and model-b, and a 50%/50% rollout. Each incoming inference request is routed to one of two models based on the flag evaluation for that user.&lt;/p&gt;

&lt;p&gt;In the LaunchDarkly dashboard, create a feature flag with the following configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Name: model-version&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://launchdarkly.com/docs/home/flags/types" rel="noopener noreferrer"&gt;Flag type&lt;/a&gt;: string&lt;/li&gt;
&lt;li&gt;Variation 1: model-a (Logistic Regression)&lt;/li&gt;
&lt;li&gt;Variation 2: model-b (Random Forest)&lt;/li&gt;
&lt;li&gt;Default rule: 50%/50% rollout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both models are trained on the same dataset (Iris), so the only variable in the A/B test is the model architecture. The code is shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_iris&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestClassifier&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LogisticRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;iris&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_iris&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;iris&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Model A: Logistic Regression (baseline)
&lt;/span&gt;&lt;span class="n"&gt;model_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LogisticRegression&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Model B: Random Forest (challenger)
&lt;/span&gt;&lt;span class="n"&gt;model_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each incoming request, the LaunchDarkly SDK evaluates the flag and returns the assigned variant. The application routes the request to the corresponding model, as shown below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sdk-YOUR-KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# For each inference request
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;variant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;prediction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After simulating 200 inference requests split across both variants, you can compare accuracy and latency to determine which model to promote.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodlu4k6ugtgjrd9tsjlv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fodlu4k6ugtgjrd9tsjlv.png" alt="Model performance comparison for accuracy and latency by variant" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After routing requests between variants, you need to aggregate the resulting outcomes so the two models can be compared on shared evaluation metrics such as accuracy and latency. To analyze A/B test results, navigate to the experiment's &lt;a href="https://launchdarkly.com/docs/home/experimentation/results-data" rel="noopener noreferrer"&gt;Results tab&lt;/a&gt; in LaunchDarkly. The &lt;a href="https://launchdarkly.com/docs/home/experimentation/results-data" rel="noopener noreferrer"&gt;Results tab&lt;/a&gt; provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Visualization options: Probability density, relative difference, and arm averages graphs&lt;/li&gt;
&lt;li&gt;Statistical analysis: Probability to be best, &lt;a href="https://launchdarkly.com/docs/home/experimentation/bayesian-results" rel="noopener noreferrer"&gt;expected loss&lt;/a&gt;, and &lt;a href="https://launchdarkly.com/docs/home/experimentation/bayesian-results" rel="noopener noreferrer"&gt;confidence intervals&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Filtering: Slice results by metric, variation, or user attributes&lt;/li&gt;
&lt;li&gt;PDF export: Download results for stakeholder review
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;variant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;total_requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;correct_predictions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;correct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;avg_latency_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mean&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;p95_latency_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percentile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Based on the results, an automated decision-making process determines whether the challenger should replace the baseline. If Model B outperforms Model A by a defined threshold, the flag default rule is updated to serve Model B to all users. If it underperforms, traffic stays on Model A.&lt;/p&gt;

&lt;p&gt;LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;Guarded Rollouts&lt;/a&gt; automate this decision-making. Configure a &lt;a href="https://launchdarkly.com/docs/home/metrics/manage-metrics" rel="noopener noreferrer"&gt;metric threshold&lt;/a&gt;, such as accuracy must not regress by more than 1%, and LaunchDarkly automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pauses the rollout if the metric degrades&lt;/li&gt;
&lt;li&gt;Rolls back to the baseline if the threshold is breached&lt;/li&gt;
&lt;li&gt;Continues the progressive rollout if metrics remain healthy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No custom code is required for promotion or rollback logic.&lt;/p&gt;

&lt;p&gt;To learn more, read the &lt;a href="https://docs.launchdarkly.com/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;Guarded Rollouts documentation&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ACCURACY_THRESHOLD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;  &lt;span class="c1"&gt;# Model B must beat Model A by at least 1%
&lt;/span&gt;
&lt;span class="n"&gt;acc_a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;acc_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;lift&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;acc_b&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;acc_a&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lift&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;ACCURACY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Promote: update LaunchDarkly flag to serve model-b to 100%
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Promote Model B to full traffic.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;lift&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ACCURACY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No significant difference. Collect more data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Rollback: keep model-a as default
&lt;/span&gt;    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Keep Model A. Model B underperforms.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Taken together, this example shows how runtime flags can separate model deployment from model release, support controlled experimentation on live traffic, and shorten rollback time when a challenger underperforms. In that sense, LaunchDarkly fits into the deployment and release-control layer of the broader MLOps lifecycle.&lt;/p&gt;

&lt;p&gt;In a live environment, automatic rollback of flag changes is possible with LaunchDarkly Guarded Rollouts. This enables automatic rollback when one of the monitored metrics regresses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extending to LLM applications
&lt;/h3&gt;

&lt;p&gt;The same progressive rollout principles apply to LLM applications, but with additional configuration dimensions. While traditional ML models require only version routing, LLMs need prompt management, temperature tuning, and provider selection. LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create" rel="noopener noreferrer"&gt;Configs&lt;/a&gt; can handle these requirements through &lt;a href="https://launchdarkly.com/docs/home/releases/percentage-rollouts" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt;, instant rollback, and automated monitoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best practices for managing the MLOps lifecycle
&lt;/h2&gt;

&lt;p&gt;A mature MLOps lifecycle connects data ingestion, feature engineering, training, deployment, and monitoring into a continuous operational loop. The objective is to make machine learning systems reliable and repeatable rather than experimental.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version and track everything: Data, code, models, and configurations should all be managed as versioned artifacts, enabling clear traceability and reproducibility of results.&lt;/li&gt;
&lt;li&gt;Automate validation and testing: Data schemas, feature transformations, and model outputs should all be validated through automated checks. Any change in code or configuration should automatically trigger validation within the CI/CD pipeline.&lt;/li&gt;
&lt;li&gt;Use feature flags and gradual rollouts: Teams should use feature flags and gradual rollouts to release new models incrementally rather than all at once. By toggling a new model behind a feature flag and gradually moving traffic over, you can monitor performance and make adjustments before fully replacing the previous version.&lt;/li&gt;
&lt;li&gt;Implement continuous monitoring: Continuous monitoring of system performance is crucial. Track key metrics such as data drift, model accuracy, system health, and infrastructure in real time. Establish alerts so issues are identified and addressed early, preventing negative impact on users.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build governance into the pipeline: Design the system with governance built in by adding &lt;a href="https://launchdarkly.com/docs/home/releases/approvals" rel="noopener noreferrer"&gt;approval workflows&lt;/a&gt;, documentation requirements, and audit logging directly into the pipeline. Include model cards and model lineage records so you have traceable evidence of the decisions made.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Structuring machine learning systems in this fashion can improve reliability, transparency, and compliance. Each step of a model's lifecycle yields clearly defined artifacts, and quality standards are enforced through automation via your pipelines. Over time, consistency between the original intent of the model, user data, and production performance creates a strong feedback loop.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mlops</category>
      <category>machinelearning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>AI Pipeline: Preventing Drift in Production Systems</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Tue, 02 Jun 2026 16:32:09 +0000</pubDate>
      <link>https://dev.to/launchdarkly/ai-pipeline-preventing-drift-in-production-systems-3k1g</link>
      <guid>https://dev.to/launchdarkly/ai-pipeline-preventing-drift-in-production-systems-3k1g</guid>
      <description>&lt;p&gt;A common failure pattern in a retrieval-augmented generation (RAG) system is a progressive decline in performance. This decline, which can be difficult for users to detect initially, often begins with a reduction in retrieval relevance. Over time, it may lead to longer response times and increasingly inaccurate, incomplete, or less helpful responses. This gradual degradation of the system's performance creates a challenging user experience.&lt;/p&gt;

&lt;p&gt;Production failures often stem from uncoordinated changes, with operators adjusting retrieval settings, reranking methods, or model routing without a shared change process. Without explicit versioning and ownership, it becomes difficult to trace which change caused a regression or who made it.&lt;/p&gt;

&lt;p&gt;This article argues that production AI pipelines, particularly RAG systems, must be designed around explicit control of change. The system must treat retrieval and prompting, evaluation, and model selection as controllable elements that people running the system must be able to modify through visible changes during active system use. The goal is not to introduce new techniques but to show how existing, well-understood methods can be composed into a production system that remains stable, measurable, and adaptable over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary of core AI pipeline design considerations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Core Focus&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;th&gt;Relevant AgentControl Config Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Problem Definition&lt;/td&gt;
&lt;td&gt;Defining use case, retrieval scope, and measurable KPIs&lt;/td&gt;
&lt;td&gt;Unenforceable or missing baselines make it impossible to detect degradation later.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/quickstart" rel="noopener noreferrer"&gt;AgentControl config&lt;/a&gt; variations and tools for retrieval depth, reranker selection, and instant switching without redeployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge Grounding and Retrieval&lt;/td&gt;
&lt;td&gt;Chunking, embeddings, retrieval, GraphRAG, and reranking&lt;/td&gt;
&lt;td&gt;Uncontrolled changes to retrieval parameters are a primary source of grounding failures.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;AgentControl config variations&lt;/a&gt; for retrieval parameters (top-k, graph hops); model-specific index routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Selection and Orchestration&lt;/td&gt;
&lt;td&gt;Routing across embedding models, rerankers, and LLMs&lt;/td&gt;
&lt;td&gt;Hard-coded models make every experiment a redeployment and every failure a production incident.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;AgentControl config variations&lt;/a&gt; bundle model, prompt, and parameters atomically; &lt;a href="https://launchdarkly.com/docs/home/ai-configs/target" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt; for A/B testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Engineering and Configuration Management&lt;/td&gt;
&lt;td&gt;Versioned, parameterized prompt templates&lt;/td&gt;
&lt;td&gt;Ungoverned prompt changes are one of the fastest ways to break a functioning pipeline.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/quickstart" rel="noopener noreferrer"&gt;AgentControl config&lt;/a&gt; for prompt versioning with &lt;a href="https://docs.launchdarkly.com/sdk/features/ai-config" rel="noopener noreferrer"&gt;variable substitution&lt;/a&gt; and &lt;a href="https://docs.launchdarkly.com/home/ai-configs/target" rel="noopener noreferrer"&gt;segment-based targeting&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation and Guardrails&lt;/td&gt;
&lt;td&gt;Grounding accuracy, safety filters, and gating logic&lt;/td&gt;
&lt;td&gt;Changes without evaluation gates allow regressions to reach users undetected.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/online-evaluations" rel="noopener noreferrer"&gt;AgentControl Online Evaluations&lt;/a&gt; for accuracy, relevance, and toxicity scoring; &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; for automatic rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Experimentation and Feature Flags&lt;/td&gt;
&lt;td&gt;Controlled variant testing under live traffic&lt;/td&gt;
&lt;td&gt;Without bounded exposure, pipeline variables interact in ways that are difficult to diagnose or reverse.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/create-variation" rel="noopener noreferrer"&gt;AgentControl config variations&lt;/a&gt; with &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;percentage rollouts&lt;/a&gt;; traffic allocation by segment or context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment and Rollout&lt;/td&gt;
&lt;td&gt;Separating code deployment from behavioral rollout&lt;/td&gt;
&lt;td&gt;Releasing behavior changes to all users at once amplifies the impacted scope of any regression.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;AgentControl config targeting&lt;/a&gt; with percentage rollouts; progressive exposure with instant rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost and Latency Optimization&lt;/td&gt;
&lt;td&gt;Complexity-based routing, caching, and batching&lt;/td&gt;
&lt;td&gt;Routing all requests through the highest-capability path increases cost without proportional quality gain.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://docs.launchdarkly.com/home/ai-configs/target" rel="noopener noreferrer"&gt;AgentControl config targeting rules&lt;/a&gt; for model tier routing; context-based cost optimization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring and Observability&lt;/td&gt;
&lt;td&gt;Tracking retrieval drift, grounding accuracy, and latency&lt;/td&gt;
&lt;td&gt;Early detection of issues. RAG systems degrade gradually: Regressions often only become visible after affecting end users.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl monitoring&lt;/a&gt; for per-variation metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feedback and Iteration&lt;/td&gt;
&lt;td&gt;Structured collection of user signals and error traces&lt;/td&gt;
&lt;td&gt;Continuous improvement loops. Ad hoc iteration based on intuition rather than signals leads to unpredictable system behavior.&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://launchdarkly.com/docs/home/agentcontrol/monitor" rel="noopener noreferrer"&gt;AgentControl configs with monitoring&lt;/a&gt; signals&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note that “hit rate” refers to the proportion of user queries for which the retrieval layer successfully returns at least one relevant document that is subsequently used in the generated response.&lt;/p&gt;

&lt;p&gt;In practice, this architecture benefits multiple roles across the AI team. Engineers can test retrieval and model changes safely under controlled exposure, product teams can iterate on prompts through versioned configurations, and operations teams gain faster response to production regressions through automated rollback and monitoring signals.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; For demo purposes, you can set up a reference implementation of a configuration-driven RAG pipeline in a single executable environment, while in practice, production usually operates with numerous services.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The following diagram shows how these stages from the above table connect in a production RAG pipeline, each independently configurable under live traffic conditions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltfxze754hxwal3non4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fltfxze754hxwal3non4c.png" alt="Production RAG pipeline stages" width="768" height="1376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Changes such as modifying retrieval depth or enabling a reranker can be exposed to a subset of users, evaluated against grounding and latency thresholds, and automatically rolled back if regressions are detected, essentially keeping iteration safe without freezing the system.&lt;/p&gt;

&lt;p&gt;Production reliability depends on three disciplines: explicit versioning of prompts and models, continuous evaluation signals, and enforced rollback logic. Together, these prevent uncontrolled drift while enabling safe iteration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem definition: Making quality measurable before you optimize
&lt;/h2&gt;

&lt;p&gt;The quality of a production RAG pipeline is strongly influenced by how clearly the problem is defined before implementation. The problem definition directly constrains downstream design choices such as retrieval scope, evaluation metrics, latency budgets, and acceptable trade-offs across the system.&lt;/p&gt;

&lt;p&gt;Start by identifying the primary use case and pairing it with measurable KPIs: retrieval hit rate, reranker lift, citation accuracy, latency budgets, and hallucination or grounding error rates. These should be treated as configurable ranges rather than fixed standards, since acceptable thresholds differ across domains and use cases.&lt;/p&gt;

&lt;p&gt;While initial requirements may live in planning tools like Jira or Confluence, AgentControl configs elevate key parameters to operational controls, making retrieval thresholds, quality gates, and rollback triggers runtime-configurable rather than static specifications. Unlike hardcoded thresholds buried in application code, AgentControl configs surface these parameters in a dashboard where they can be adjusted, monitored, and rolled back by anyone with access, not just engineers with deployment permissions.&lt;/p&gt;

&lt;p&gt;For example, teams may externalize the thresholds for retrieval quality, reranker effectiveness, and latency as configuration flags that control enforcement and rollback. In Python, these can be combined as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NOTE:
# Teams can externalize evaluation thresholds as runtime
# configuration so enforcement logic is adjustable without
# redeployment.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fallback_value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-eval-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;custom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_custom&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_custom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="n"&gt;min_retrieval_hit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_retrieval_hit_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;min_reranker_lift&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_reranker_lift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;max_latency_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_retrieval_latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;measured_hit_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;min_retrieval_hit&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;measured_latency&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_latency_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Threshold violation detected; using fallback path.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fallback_response&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Continue normal pipeline execution
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this model, thresholds are active policies integrated into evaluation gates and rollout controls. Changes can be exposed incrementally, measured against live metrics, and automatically rolled back when performance degrades. By treating configuration as an operational control surface rather than static settings, the pipeline remains adaptable without sacrificing production stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Knowledge grounding: Designing retrieval as a configurable system
&lt;/h2&gt;

&lt;p&gt;In production RAG workflows, unregulated changes to retrieval methods and performance parameters—such as chunking strategies, embedding models, graph traversal depth, or reranker settings—often degrade retrieval quality. This degradation then propagates downstream, manifesting as grounding failures during generation. To minimize this risk, the retrieval layer should be designed as a thoroughly parameterized system, encompassing chunking methods (fixed or semantic), embedding model selection, retrieval depth (top-k), GraphRAG traversal depth, reranker configuration, and context window limits.&lt;/p&gt;

&lt;p&gt;A layered retrieval approach might consist of vector-based retrieval of unstructured data, optional graph-based expansion for the improvement of relational context, and reranking for precision. A control-plane system governs the parameters exposed by each pipeline layer, making them observable, configurable, and safe to experiment with without modifying application code. In LaunchDarkly, AgentControl configs provide this control layer, storing retrieval configuration as versioned variations that can be tested incrementally and rolled back instantly. Retrieval quality remains adjustable at runtime, and the retrieval quality is not dependent on the speed of the assessment of the variants (e.g., chunk size or hop depth), since the assessment can be done with live traffic, provided that changes are gated and evaluated incrementally.&lt;/p&gt;

&lt;p&gt;Parameters such as top-k, graph-hop depth, reranker toggles, embedding model selection, and fallback behavior are governed through configuration, enabling safe iteration without redeployment. AgentControl config targeting enables instant fallback by switching which variation is served with no redeployment required. If a new retrieval strategy degrades quality, revert to the baseline variation in seconds. This means existing vector stores such as Pinecone, Weaviate, FAISS for embeddings, and Neo4j for knowledge graphs are able to continue being used.&lt;/p&gt;

&lt;p&gt;For instance, it is possible to construct a multi-layer pipeline: RAG on internal documents, GraphRAG via Neo4j for structured data, and a cross-encoder reranker. The transitions between these layers can be made configurable, allowing reranking to be enabled or disabled, embedding strategies to be adjusted, and routing logic to evolve while preserving production quality.&lt;/p&gt;

&lt;p&gt;In practice, retrieval parameters can also be managed as part of a versioned AI configuration, allowing chunking, retrieval depth, graph expansion, reranking, and index selection to evolve together under controlled rollout.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;premium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-retrieval-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;custom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_custom&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_custom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="n"&gt;chunk_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chunk_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;350&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;retrieval_top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval_top_k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;enable_graph_rag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable_graph_rag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph_hops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;graph_hops&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;enable_reranker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable_reranker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;e5-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;reranker_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reranker_model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross-encoder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Note: Switching embedding models requires separate precomputed indexes
&lt;/span&gt;    &lt;span class="c1"&gt;# per model. The configuration should control both the embedding model
&lt;/span&gt;    &lt;span class="c1"&gt;# and the index being queried.
&lt;/span&gt;
    &lt;span class="c1"&gt;# Note: Increasing retrieval_top_k sends more retrieved content downstream,
&lt;/span&gt;    &lt;span class="c1"&gt;# which can increase token usage, cost, and context-window pressure.
&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;vec_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retrieval_top_k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;graph_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;enable_graph_rag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;graph_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;neo4j_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;graph_hops&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# In production, graph expansion should apply relevance filtering
&lt;/span&gt;    &lt;span class="c1"&gt;# or weighted merging to avoid flooding the context window.
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec_results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;graph_results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;enable_reranker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reranker_model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Apply context limits/fallbacks as needed
&lt;/span&gt;&lt;span class="n"&gt;final_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;trim_to_context_budget&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this model, retrieval behavior becomes a controlled surface rather than a static implementation detail. Teams can enable or disable GraphRAG, adjust traversal depth, swap embedding models, or toggle rerankers safely while monitoring grounding accuracy and latency. This configuration-driven approach keeps retrieval flexible without sacrificing production stability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model selection and orchestration
&lt;/h2&gt;

&lt;p&gt;Hard-coding embedding models, rerankers, or LLMs directly into orchestration logic is a common anti-pattern in production AI systems that is easy to trace. The case becomes even more difficult if the embedding model, reranker, or chat model is hard-coded, for then every experiment becomes a redeployment. This will not only slow down the learning process but also increase the risk at the same time.&lt;/p&gt;

&lt;p&gt;Model selection should be treated as a routing process rather than a one-time selection decision. AgentControl configs bundle model, prompt, temperature, and max_tokens as a single versioned configuration. When you switch variations, all parameters change atomically, reducing the risk of mismatched model/prompt combinations that can occur when using separate flags for each parameter. In practice, this means that each request is dynamically routed to a model variant based on configuration, traffic allocation, or runtime signals rather than binding the pipeline to a single hard-coded model. The pipeline is asking for “an embedding model” or “a chat model” all the time. Model selection is governed through configuration rather than hard-coded API calls.&lt;/p&gt;

&lt;p&gt;In LaunchDarkly, AgentControl config bundles the model, prompt, temperature, and max_tokens as a single versioned variation. When a variation changes, these parameters update atomically, eliminating the risk of mismatched configurations and allowing traffic allocation and fallback behavior to be controlled safely. Fallbacks can be triggered by concrete conditions such as degradation in grounding accuracy, violations of latency budgets, elevated error rates, or failed evaluation checks, allowing the pipeline to revert to a known-stable model automatically.&lt;/p&gt;

&lt;p&gt;The following simplified example illustrates how model routing can be externalized through configuration. Rather than binding the pipeline to a specific chat model, the active variant is selected at runtime based on a configuration flag, enabling controlled experimentation and safe fallback behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Illustrative example showing configuration-driven model routing
# using LaunchDarkly AgentControl config.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluation context used for targeting and experiments
&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Retrieve the full AI configuration (model, prompt, parameters)
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_openai_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxTokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AgentControl configs separate model selection from application code entirely. The pipeline requests a configuration, and AgentControl config returns the complete model setup based on targeting rules, enabling A/B tests, gradual rollouts, and instant rollback without code changes.&lt;/p&gt;

&lt;p&gt;Controlled experiments can be conducted behind this one banner. For example, a Mistral or LLaMA-based deployment can be given just 5% of the total traffic while the baseline continues to be unaffected. Experimentation primitives such as traffic allocation, targeting, and instant kill switches support safer operation in production when combined with proper evaluation signals, monitoring, and rollback discipline, but they do not replace sound system design or operational oversight.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prompt engineering and configuration management
&lt;/h2&gt;

&lt;p&gt;Prompts are not constant resources; they move with changes in requirements, the evolution of data, and the appearance of edge cases. A lack of governance, coupled with changing prompts, is one of the quickest methods to cause the uprooting of a perfectly functioning pipeline.&lt;/p&gt;

&lt;p&gt;AgentControl configs store prompts as versioned configurations in LaunchDarkly rather than in application code. Prompts support variable substitution such as &lt;code&gt;{{context}}&lt;/code&gt; and &lt;code&gt;{{user_tier}}&lt;/code&gt;, and the template structure, variable slots, and active prompt variants can all be versioned and selected at runtime. This allows teams to test prompt variants, compare outcomes, and restore previous versions when needed. The following simplified example shows how a prompt variant might be selected through configuration at runtime.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-assistant-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;grounded_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;premium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# Variable substitution
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Prompt is stored in LaunchDarkly, not in code.
# Variables like {{context}} and {{user_tier}} are substituted automatically.
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This method allows for structured testing, selecting specific users to expose to the new feature, and quickly going back to the previous version, thus harmonizing prompt iteration with the deployment discipline already established for code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluation and guardrails
&lt;/h2&gt;

&lt;p&gt;During evaluation, configuration values remain in effect, but the focus shifts from the configuration itself to measurable attributes of system behavior, such as grounding quality, latency, and safety-related metrics. Changes in retrieval, prompts, or models should be governed by both objective metrics and qualitative evaluation signals.&lt;/p&gt;

&lt;p&gt;Objectively speaking, the correctness of grounding, the time taken, and the accuracy of citations are among the measures applied. Relevance and helpfulness are typically assessed through LLM-as-judge patterns, an approach popularized by tools such as OpenAI Evals and Patronus.&lt;/p&gt;

&lt;p&gt;AgentControl includes built-in Online Evaluations that allow teams to attach judges for metrics such as accuracy, relevance, and toxicity to any variation. Sampling rates can be configured, and the resulting scores appear in the Monitoring dashboard alongside operational metrics such as latency and cost. These signals should be regarded as indicators of relative change rather than absolute truths. AgentControl displays them per variation, making it easy to compare whether a variant actually outperforms the baseline without building custom analytics. When used together through Guarded releases, they drive gating decisions automatically, pausing rollout exposure or triggering rollback when quality thresholds are violated without requiring manual intervention.&lt;/p&gt;

&lt;p&gt;Safety evaluation typically focuses on detecting risks related to personally identifiable information (PII), toxicity, and compliance violations. Deterministic detectors such as Presidio are often used alongside probabilistic classifiers and cloud DLP services to reduce false negatives. In addition, evaluation systems can attach automated judges to monitor safety signals. For example, AgentControl Online Evaluations can apply toxicity judges to sampled responses and surface the results in monitoring dashboards.&lt;/p&gt;

&lt;p&gt;The following simplified example illustrates how evaluation signals can be computed by the application and emitted as events to support configuration-driven gating decisions. In this pattern, scoring logic remains inside the application, while promotion or rollback behavior is governed through configurable rules.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_and_gate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;variant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compute_eval_scores&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pii_risk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_pii_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# AI SDK provides automatic tracking of tokens, duration, and errors.
&lt;/span&gt;    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_openai_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Optional: track additional custom metrics.
&lt;/span&gt;    &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag_quality_metrics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;variant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;variant_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latency_ms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pii_risk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pii_risk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Configuration-driven gating.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mf"&gt;0.88&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable_rollback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fallback_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Evaluation signals are computed by the application and emitted as events to support configuration-driven gating decisions. When grounding accuracy falls below the configured threshold, guarded rollouts automatically pause the variant and restore the baseline, without requiring manual intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Experimentation and feature flags
&lt;/h2&gt;

&lt;p&gt;As soon as evaluation and guardrails are implemented, experimentation ceases to be treated as such and is instead fully integrated within the system's daily cycle. The state of the pipeline at this stage is not “trying out methods and praying for the best” but an incessant, subtle, and well-managed learning process.&lt;/p&gt;

&lt;p&gt;In practice, RAG-based experimentation is rarely isolated to a single variable. Adjustments in one area often influence others. For example, increasing retrieval depth changes the volume of context supplied to the model, graph traversal affects which documents are visible, rerankers modify relevance ordering, prompt changes alter tone and structure, and switching models impacts latency and cost. These dimensions interact, which makes controlled experimentation and careful gating essential.&lt;/p&gt;

&lt;p&gt;AgentControl configs make these interactions explicit and controllable. Each variation represents a complete configuration—model, prompt, parameters, and tools—that can be tested against others under controlled traffic allocation, allowing multiple variables to evolve under bounded exposure. Traffic allocation and evaluation thresholds are managed through AgentControl configs, while Guardian-guarded rollouts enforce rollback conditions automatically when metrics indicate regression. The pipeline decides at runtime which choices to make instead of sending out a new deployment every time there is an idea to be tested. The code remains unchanged; only the behavior changes.&lt;/p&gt;

&lt;p&gt;Each configuration change becomes a small, bounded experiment with a clearly defined blast radius and rollback path. Every meaningful decision in the pipeline is externalized to AgentControl configs. Multiple variations evolve safely under controlled exposure, with built-in metrics showing which performs better, no custom instrumentation required. From the application's perspective, this process is straightforward: On each request, it simply retrieves the active configuration and executes accordingly.&lt;/p&gt;

&lt;p&gt;The following simplified example demonstrates how multiple pipeline parameters can be externalized as configuration variables. Rather than hard-coding retrieval depth, graph traversal limits, reranker activation, prompt versions, or model variants, these values are resolved at runtime, enabling controlled experimentation and gradual rollout. In this pattern, retrieval depth, graph expansion, reranking behavior, and model selection are resolved together as part of a versioned AI configuration rather than managed as unrelated flags.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;environment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;production&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rag-pipeline-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;custom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_custom&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_custom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;top_k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieval_top_k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph_hops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;graph_hops&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;enable_reranker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;custom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable_reranker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;model_variant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this pattern, experimentation occurs by adjusting configuration values and traffic allocation rather than modifying orchestration logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; No redeployment is required to adjust retrieval depth, switch prompt variants, or test new models.&lt;/p&gt;

&lt;p&gt;Configuration determines runtime behavior, while evaluation metrics determine whether those changes persist. For example, a config with two variations might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Variation A (Baseline):&lt;/strong&gt; GPT-4o-mini, temperature 0.3, concise prompt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variation B (Experimental):&lt;/strong&gt; Claude 3 Haiku, temperature 0.5, detailed prompt with citations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use percentage rollouts to send 10% of traffic to Variation B, then compare token cost, latency, and quality metrics in the LaunchDarkly dashboard before promoting.&lt;/p&gt;

&lt;p&gt;The diagram below illustrates how an experiment progresses from limited exposure to promotion or rollback based on measurable thresholds.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazsrl9ts8kyyuysufo66.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazsrl9ts8kyyuysufo66.png" alt="Decision tree for experiment promotion or rollback" width="800" height="993"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this stage, experimentation becomes a routine, low-risk operational activity rather than an ad hoc process with uncertain production impact. Variants are promoted only when metrics validate improvement; otherwise, rollback restores the baseline automatically. Decisions are governed by thresholds and enforced by configuration logic, not by informal coordination or manual caution.&lt;/p&gt;

&lt;p&gt;In practice, this shifts how teams manage risk. Engineers can test ideas earlier and under real traffic, while product teams receive measurable feedback instead of speculation. When regressions occur, as they inevitably will, the system absorbs them predictably through rollback mechanisms rather than escalating into production incidents.&lt;/p&gt;

&lt;p&gt;This is not experimentation for its own sake. It is controlled exposure, enforced by configuration and measurable thresholds rather than personal discipline alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment and rollout
&lt;/h2&gt;

&lt;p&gt;In AI pipelines, notably RAG systems, deployment and rollout are the key operational milestones that support controlled execution and minimize the risk of production disruptions. Deployment refers to changes in code or infrastructure. Rollout, by contrast, refers to the controlled exposure of new behavior in production, such as introducing new models or configuration variants gradually to reduce risk.&lt;/p&gt;

&lt;p&gt;Separating deployment from rollout allows teams to validate behavior changes incrementally under real traffic conditions. Take, for instance, a new LLM version rollout: Start with a small percentage of traffic and track grounding correctness and latency. If metrics remain healthy, exposure can be increased. If issues emerge, rollback is immediate and configuration-driven.&lt;/p&gt;

&lt;p&gt;LaunchDarkly controls these rollouts through &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/target" rel="noopener noreferrer"&gt;AgentControl Configs percentage-based targeting&lt;/a&gt; and segment rules, with &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;Guardian guarded rollouts&lt;/a&gt; that automatically pause or roll back based on quality signals. Exposure is increased only when metrics confirm the new variation is safe. Actual deployment is handled by infrastructure tools such as Kubernetes or Docker, while LaunchDarkly acts as the rollout control layer, managing fallback routes and ensuring configuration consistency across services. For example, retrieval config updates might be first shown to beta users and later on rolled out based on performance.&lt;/p&gt;

&lt;p&gt;Here is a Python example that uses the LaunchDarkly SDK to manage the rollouts dynamically in your pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;

&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_rag_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;segment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;segment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;completion_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm-rollout-config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;variables&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Model, prompt, and parameters are versioned together.
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_openai_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;generate_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;log_performance_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fallback_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Close ld_client in your application's shutdown hook, not here.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This setup allows rollout percentages, targeting rules, and cohort segmentation to be adjusted directly through configuration. Exposure can be increased incrementally under live traffic, while evaluation metrics determine whether promotion continues or rollback is triggered.&lt;/p&gt;

&lt;p&gt;Separating deployment from rollout ensures that behavioral changes are introduced gradually and reversibly, reducing production risk while maintaining iteration speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and latency optimization
&lt;/h2&gt;

&lt;p&gt;Complex queries usually call for the use of models of higher capability to satisfy the quality requirement, which usually means longer latency and more cost. On the other hand, many requests can be handled with lower-cost paths when quality signals remain within acceptable limits.&lt;/p&gt;

&lt;p&gt;This is fundamentally about dynamic optimization. Possible strategies include routing queries based on complexity, caching frequently accessed results, adjusting retrieval depth, and batching requests where appropriate. These techniques aim to balance accuracy, latency, and cost rather than optimizing any one dimension in isolation.&lt;/p&gt;

&lt;p&gt;In practice, routing policies and cost controls can be governed through AgentControl configs targeting rules. Requests can be routed based on attributes such as user tier, region, or query characteristics. For example, complex queries may be sent to a larger model such as GPT-4 while simple lookups are routed to a smaller model, and different retrieval or reranking strategies can be applied to different traffic segments, all without modifying application code.&lt;/p&gt;

&lt;p&gt;A complexity classifier provides a practical example of cost-aware routing in production AI systems. Rather than routing all requests to the most expensive model and deepest retrieval path, the system evaluates incoming queries and dynamically selects an appropriate tier.&lt;/p&gt;

&lt;p&gt;Routing decisions are not hard-coded. Instead, configuration flags determine tier selection and threshold limits. This allows routing behavior to evolve safely under live traffic without redeployment. In practice, complexity classifiers should be treated as heuristics and continuously calibrated using production metrics.&lt;/p&gt;

&lt;p&gt;Below is an illustrative example of cost- and latency-aware routing controlled through configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NOTE:
# Illustrative example showing cost- and latency-aware routing via configuration.
# The complexity heuristic below is a placeholder and must be adapted
# to your domain, metrics, and production requirements.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;

&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Placeholder heuristic for illustration only.
&lt;/span&gt;    &lt;span class="c1"&gt;# Production systems usually consider richer signals such as:
&lt;/span&gt;    &lt;span class="c1"&gt;# - keyword density or semantic difficulty
&lt;/span&gt;    &lt;span class="c1"&gt;# - question structure and reasoning depth
&lt;/span&gt;    &lt;span class="c1"&gt;# - historical user interaction patterns
&lt;/span&gt;    &lt;span class="c1"&gt;# - observed quality or latency metrics
&lt;/span&gt;    &lt;span class="c1"&gt;# Any routing heuristic should be validated against production
&lt;/span&gt;    &lt;span class="c1"&gt;# evaluation metrics before broad rollout.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_optimized_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_context_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Configuration-controlled routing decision.
&lt;/span&gt;    &lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llm_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;complexity_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complexity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;query_complexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;complexity_threshold&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Select model based on resolved tier.
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;small_llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tier&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large_llm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Optional escalation path if quality signals fall below tolerance
&lt;/span&gt;    &lt;span class="c1"&gt;# (for example retrying with a larger model if grounding or confidence checks fail).
&lt;/span&gt;
    &lt;span class="nf"&gt;log_optimization_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In many production deployments, routing decisions can also be managed directly through AgentControl config targeting rules. For example, different model tiers may be served based on user subscription level, geographic region, or environment without requiring custom routing logic in application code.&lt;/p&gt;

&lt;p&gt;This pattern allows teams to adjust routing tiers, thresholds, and fallback behavior dynamically. Lightweight requests can be served at lower cost and latency, while complex queries are automatically escalated to higher-capability paths. Performance and quality remain observable and controllable through configuration.&lt;/p&gt;

&lt;p&gt;The diagram below illustrates how routing decisions move between lightweight and high-capability paths based on configuration and runtime signals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezf02b2txkii9zec1ez9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezf02b2txkii9zec1ez9.png" alt="Workflow for lightweight and high-capability routing paths" width="768" height="1376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lightweight path (small model with shallow retrieval) and the high-capability path (large model with deep retrieval) converge before caching, batching, and final response generation. This ensures that cost optimization does not fragment the delivery pipeline and that observability remains consistent across tiers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring and observability
&lt;/h2&gt;

&lt;p&gt;In every production RAG pipeline, performance monitoring is a must-do process that helps you discover issues before they get out of control. To do this, it is necessary to monitor the most important metrics, which include retrieval drift, grounding accuracy, hallucination rates, and related reliability signals. Latency spikes and the overall health of the API will be monitored as well. In practice, these metrics should be tracked over time and analyzed across percentiles (e.g., p95, p99) rather than relying solely on averages.&lt;/p&gt;

&lt;p&gt;These observability signals will then be incorporated into your control plane, for instance, LaunchDarkly, to automate actions such as rollbacks or switching to safe-mode configurations whenever things go wrong. Observability provides the signals; AgentControl configs enforce the decisions. When grounding accuracy drops below the threshold, AgentControl configs can automatically shift traffic back to the stable variation without waiting for manual intervention.&lt;/p&gt;

&lt;p&gt;Why invest in continuous monitoring? Because RAG systems are inherently dynamic: models evolve, underlying data shifts, and user behavior changes across contexts. These factors can gradually degrade performance in ways that are not immediately visible. Continuous monitoring is necessary because regressions often become apparent only after affecting end users. Early detection of hallucinations or accuracy drops allows issues to be addressed proactively, improving reliability without constant manual intervention.&lt;/p&gt;

&lt;p&gt;Telemetry data can be collected through vendor-neutral observability frameworks such as OpenTelemetry or through an organization's internal monitoring infrastructure. AgentControl configs also provide a built-in monitoring dashboard that surfaces variation-level metrics automatically, allowing teams to compare model and prompt performance across experiments without additional instrumentation. The dashboard reports metrics such as token usage, cost, latency, and quality scores for each configuration variant. These metrics can then be used by rollout controls to suspend, promote, or roll back configurations when thresholds fall outside acceptable ranges.&lt;/p&gt;

&lt;p&gt;For example, if the performance of the grounding deteriorates, LaunchDarkly will immediately unmask the experimental model and pull the reranker back to reliable settings, depending solely on the live data. The intervention should be driven by predefined thresholds and rules defined per metric rather than by a single global criterion, and not by unpredictable human interference.&lt;/p&gt;

&lt;p&gt;Shown below is a Python snippet that demonstrates feeding metrics into LaunchDarkly for decision-making purposes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NOTE:
# Illustrative example showing how observability signals can be
# used to drive configuration-based rollback decisions.
# Telemetry is collected by external monitoring systems
# and evaluated against configuration thresholds.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_current_metrics&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Placeholder for real telemetry collected through
&lt;/span&gt;    &lt;span class="c1"&gt;# observability systems such as OpenTelemetry.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hallucination_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;monitor_and_adjust&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_current_metrics&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;accuracy_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grounding_accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;accuracy_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Configuration-driven rollback decision.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enable_rollback&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;revert_to_stable_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Metrics are typically emitted to observability systems
&lt;/span&gt;    &lt;span class="c1"&gt;# and evaluated alongside AgentControl monitoring dashboards.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When AI requests are executed through the AI SDK, operational metrics such as token usage, cost, latency, and success rates are captured automatically using the track_openai_metrics() instrumentation. These signals appear in the AgentControl monitoring dashboard alongside evaluation scores and configuration variations, enabling teams to compare model and prompt performance without building custom analytics pipelines.&lt;/p&gt;

&lt;p&gt;Using observability data to inform LaunchDarkly controls makes your pipeline adaptive. Teams get traceability, faster incident response, and safer production iteration by clearly separating metric computation from configuration enforcement.&lt;/p&gt;

&lt;p&gt;To help you visualize, here's a flowchart of the observability pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82rmkyaf4oef1thioef5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82rmkyaf4oef1thioef5.png" alt="Flowchart of the observability pipeline" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Observability metrics feed directly into configuration-driven gating rules, enabling automatic rollback or promotion without manual intervention. The control plane does not compute metrics itself; it enforces decisions based on thresholds defined in configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feedback and iteration
&lt;/h2&gt;

&lt;p&gt;Closing the RAG pipeline loop involves activating structured user feedback, such as ratings, error logs, and usage traces, and feeding these signals back into retrieval, prompt, or model adjustments. AgentControl configs then control the release of refined configurations. New prompt variations can be tested on a small percentage of traffic, evaluated against satisfaction metrics, and promoted only when signals confirm improvement, ensuring that they are tested under limited exposure before broader rollout. This promotes a never-ending process of evolution with closely-knit feedback loops, allowing you to quickly iterate while still having a solid production environment, without requiring broad, unmanaged production changes.&lt;/p&gt;

&lt;p&gt;In practice, these feedback signals can be sourced from RLHF-derived signals, user feedback systems, satisfaction metrics, or built-in telemetry. LaunchDarkly coordinates the rollout of the updates, managing the exposure or reversions depending on the feedback received.&lt;/p&gt;

&lt;p&gt;Feedback signals such as lower satisfaction scores, increased clarification requests, or higher error rates indicate that users struggle with the new prompt format. The new prompt variant can be kept in evaluation-only mode, allowing time to refine and retest through online evaluations, automated or semi-automated assessments run on live or shadow traffic, before any wider rollout.&lt;/p&gt;

&lt;p&gt;Here's a Python snippet to integrate feedback signals with LaunchDarkly for adaptive rollouts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NOTE:
# Illustrative example showing how aggregated feedback signals
# can influence configuration-driven rollout decisions.
# Feedback signals are computed by the application and evaluated
# against thresholds managed through configuration.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user_feedback&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Placeholder for aggregated feedback signals.
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satisfaction_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_sdk_key&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_feedback_and_iterate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;feedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_user_feedback&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;satisfaction_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satisfaction_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mf"&gt;0.80&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satisfaction_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;satisfaction_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Promote variant only when feedback trends meet acceptance criteria.
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;variation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;promote_variant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;rollout_updated_variant&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Keep variant in evaluation mode or rollback.
&lt;/span&gt;        &lt;span class="nf"&gt;revert_to_baseline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Track user feedback with AI SDK.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;satisfaction_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;satisfaction_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_feedback&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_feedback&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;negative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The method used here makes the iterations both data-driven and lower-risk. Teams blend the feedback with the same setup and launch discipline as other pipeline modifications, thus preventing ad hoc decision-making and ensuring that the system behaves predictably even while it is being continuously evolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Last thoughts
&lt;/h2&gt;

&lt;p&gt;Production AI systems rarely fail because a model is imperfect; they fail because change is unmanaged. In RAG pipelines, even small adjustments to retrieval depth, reranking logic, prompt structure, or model routing can compound quickly under real traffic. When those decisions are embedded directly in code, iteration becomes slow, risky, and difficult to reverse.&lt;/p&gt;

&lt;p&gt;The goal is not to freeze behavior but to externalize it. AgentControl configs provide that external control surface: versioned configurations, percentage rollouts, automatic metrics, and instant rollback. When configuration, experimentation, evaluation, and rollout are treated as first-class architectural concerns, change becomes measurable and reversible. Teams can introduce improvements incrementally, observe their impact under real conditions, and roll back regressions without destabilizing the system.&lt;/p&gt;

&lt;p&gt;Unlike general-purpose feature flag workflows, AgentControl configs are designed specifically for runtime AI configuration, combining prompt versioning, automatic metrics tracking, and online evaluations in a single control surface. Unlike broader MLOps platforms, it focuses on operational behavior in production rather than training pipeline management.&lt;/p&gt;

&lt;p&gt;The more dynamic and compositional AI systems become, the more valuable controlled change becomes. Production RAG is not about finding a perfect configuration. It is about building a system that can evolve safely, intentionally, and continuously. AgentControl configs make this practical, giving teams a single place to manage model selection, prompt engineering, and quality evaluation, with the safety nets needed for production AI systems. Get started with the &lt;a href="https://launchdarkly.com/docs/home/agentcontrol/quickstart" rel="noopener noreferrer"&gt;AgentControl Quickstart&lt;/a&gt; or explore the &lt;a href="https://launchdarkly.com/docs/sdk/ai/python" rel="noopener noreferrer"&gt;Python AI SDK&lt;/a&gt; for implementation examples.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The standard advice on handling bias in tech is wrong</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Tue, 19 May 2026 18:19:19 +0000</pubDate>
      <link>https://dev.to/sattensil888/the-standard-advice-on-handling-bias-in-tech-is-wrong-359l</link>
      <guid>https://dev.to/sattensil888/the-standard-advice-on-handling-bias-in-tech-is-wrong-359l</guid>
      <description>&lt;p&gt;Every woman who has worked in tech long enough has gotten the same three pieces of advice. Confront the behavior in the moment. Build an allyship network. Escalate to HR if it crosses a line.&lt;/p&gt;

&lt;p&gt;I have tried all three. The first one cost me a working relationship I needed. The second one created a debt I never asked for. The third one put my name on a list and changed nothing about the person who did the thing.&lt;/p&gt;

&lt;p&gt;This post is about what I do instead.&lt;/p&gt;

&lt;p&gt;To be clear about scope: I am a Senior Developer Educator who has worked across data science, AI/ML, and developer education at multiple companies. My experience covers technical IC work and content-focused roles. If you are a manager of a 200-person org, the calculus may be different. If you are early career and still building a reputation, parts of this will not apply yet.&lt;/p&gt;

&lt;p&gt;Here is what I have learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Confronting in the moment makes you the problem
&lt;/h2&gt;

&lt;p&gt;The advice goes like this. Someone interrupts you, takes credit for your work, or makes a comment that lands wrong. You name it on the spot. "I was still speaking." "That was my idea from the sync last Tuesday." Direct, calm, professional.&lt;/p&gt;

&lt;p&gt;The intent is good. The math does not work.&lt;/p&gt;

&lt;p&gt;Confrontation in the moment costs you the room. Whatever you were trying to accomplish in that meeting is now derailed. You become the person who made the meeting awkward. The other party almost never says "you are right, I apologize." They explain, they soften, they reframe. The meeting moves on. You have spent your credibility to teach a lesson nobody asked to learn.&lt;/p&gt;

&lt;p&gt;Worse, it lodges in the calibration discussion. Six months later, when someone is evaluating you for a stretch project, the data point that surfaces is "she can be difficult in meetings." Not the original behavior. Your response to it.&lt;/p&gt;

&lt;p&gt;I am not saying never push back. I am saying the price is real and it falls on you, not on the person who caused the problem. Spend that currency on the things that matter. The interruption in a five-person room where the work product is at stake is worth a response. The interruption in a 30-person all-hands is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Allyship asks create the wrong dependency
&lt;/h2&gt;

&lt;p&gt;The advice here goes like this. Find a senior man who will amplify your voice in meetings, vouch for your ideas, interrupt the interrupter on your behalf. Allies are powerful. Find them.&lt;/p&gt;

&lt;p&gt;I have had good allies. They have made real differences. The problem is the structure.&lt;/p&gt;

&lt;p&gt;When your visibility in a room depends on a man being in the room and choosing to use his voice for you, you have outsourced your authority. The ally goes on vacation. The ally moves to a different team. The ally has his own promotion cycle and is suddenly less available. Your visibility goes with him.&lt;/p&gt;

&lt;p&gt;There is also a quieter cost. Performative allyship is its own ecosystem. Some allies advocate in ways that benefit their own brand more than they shift the dynamic. They make a visible show of speaking up, the room notices the ally, and you become a supporting character in someone else's narrative arc. Your idea becomes the thing the ally championed.&lt;/p&gt;

&lt;p&gt;The version of this that works is mutual. You have a peer or near-peer who knows your work cold, and you do the same for them, and you happen to mention each other's contributions because you have actually been in the trenches together. That is not allyship as it is sold. That is normal professional credit, exchanged between people who respect each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  HR works for the company
&lt;/h2&gt;

&lt;p&gt;This one is the hardest to say out loud, because it sounds cynical and people want to believe HR is on their side.&lt;/p&gt;

&lt;p&gt;HR exists to manage legal and reputational risk for the company. When that risk aligns with protecting you, they will protect you. When it does not, they will manage you. Both things can be true of the same HR person on the same day.&lt;/p&gt;

&lt;p&gt;When you escalate to HR, three things happen at once. You go on a list. The other party gets coached, usually lightly. Your manager finds out, even if you were told it would be confidential, because your manager has to be looped in for any next-round decisions that involve you. The dynamic with your manager is permanently different from that moment on. They are now managing both you and a known issue, and they are evaluating which one is easier to make go away.&lt;/p&gt;

&lt;p&gt;This does not mean never go to HR. It means go to HR with clear eyes. Use it when the behavior is severe enough that you would leave over it anyway, or when you need a paper trail because you have already decided to sue. Do not use it as the first or second line of response. The cost-benefit only works at the extremes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually do
&lt;/h2&gt;

&lt;p&gt;The replacement playbook is less dramatic. It is also more effective, at least for me.&lt;/p&gt;

&lt;h3&gt;
  
  
  Document everything for yourself
&lt;/h3&gt;

&lt;p&gt;I keep a private file. Date, what happened, who was there, my response if any. I do not write it for HR or for a future case. I write it so that I can see patterns.&lt;/p&gt;

&lt;p&gt;Gender dynamics are gaslighting-prone. The behavior is often plausibly deniable in any single instance. You start to wonder if you are imagining it. The file makes that question answerable. 12 interruptions across four meetings with the same person is a pattern. One is a moment.&lt;/p&gt;

&lt;p&gt;The file also helps you decide. When I review it monthly, I notice which incidents still feel hot and which ones I have moved past. The ones still hot get attention. The ones I have moved past taught me they were not worth the calories.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build influence outside the orbit
&lt;/h3&gt;

&lt;p&gt;The most powerful move I know is to make your reputation depend on people who are not in the dynamic. Skip-levels. Adjacent teams. External community. Public work.&lt;/p&gt;

&lt;p&gt;If your visibility comes from your manager and your immediate peers, then any bias in that small group has outsized weight. If your visibility also comes from a talk you gave at a conference, a tutorial that is the top result for a search query, a working relationship with a VP two levels up, that small group's opinion is one data point among many.&lt;/p&gt;

&lt;p&gt;This is also the highest-value move you can make for your career independent of any bias. The bias problem and the career problem have the same solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose what to ignore on purpose
&lt;/h3&gt;

&lt;p&gt;Most micro-incidents are not worth a response. Picking your battles is not surrender. It is portfolio management.&lt;/p&gt;

&lt;p&gt;The principle I use: respond to things that have ongoing operational cost, ignore things that are purely social. Someone taking credit for your work in front of your skip-level has operational cost. Someone explaining your own field to you at a happy hour does not. The first shapes future decisions about your work. The second is a bad time you can walk away from.&lt;/p&gt;

&lt;p&gt;The energy you save here is what you spend on the work that compounds. Shipping the thing. Writing the post. Giving the talk. The work is what creates room to maneuver.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leave well, not loud
&lt;/h3&gt;

&lt;p&gt;Sometimes the answer is to go.&lt;/p&gt;

&lt;p&gt;When that is the answer, leaving well matters more than leaving loud. A loud exit feels righteous. It also makes every future hiring manager nervous. Leaving well means quiet, on your timeline, with the relationships you want to keep intact. Tell the people who matter to you that you are going. Do not tell the people who caused the problem anything they have not earned.&lt;/p&gt;

&lt;p&gt;The best revenge, if you want to think of it that way, is to show up two years later with a better title at a better company and have your work cited in the room you left. The loud exit forecloses on that future. The quiet exit keeps it open.&lt;/p&gt;

&lt;h2&gt;
  
  
  The underlying trade
&lt;/h2&gt;

&lt;p&gt;The standard advice positions the woman experiencing bias as the agent of correction. Confront him. Recruit allies for yourself. Escalate up the chain. All of it is work, all of it is risky, all of it is uncompensated.&lt;/p&gt;

&lt;p&gt;The replacement is to stop trying to fix the dynamic and start building standing that makes the dynamic less relevant. None of this fixes the structural problem. The structural problem will outlast any individual woman's career strategy. What this does is keep you operating at full capacity inside an imperfect system, while the slow work of fixing the system happens at the pace it happens.&lt;/p&gt;

&lt;p&gt;That is the trade I have made. Your trade may look different.&lt;/p&gt;

</description>
      <category>career</category>
      <category>careerdevelopment</category>
      <category>hr</category>
      <category>wecoded</category>
    </item>
    <item>
      <title>The second-order cost of a(nother) layoff year.</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Mon, 18 May 2026 23:14:46 +0000</pubDate>
      <link>https://dev.to/sattensil888/the-second-order-cost-of-a-layoff-year-15hl</link>
      <guid>https://dev.to/sattensil888/the-second-order-cost-of-a-layoff-year-15hl</guid>
      <description>&lt;p&gt;Layoff years compress organizations. Scopes overlap. Lanes that were clear during a growth year start running into each other. Most of the resulting friction works itself out. Some of it doesn't.&lt;/p&gt;

&lt;p&gt;than 300 tech layoff events. Meta is reportedly preparing to cut around 8,000 employees, Oracle reportedly sent 6 a.m. layoff notices in a restructuring that could affect up to 30,000 roles, and Microsoft has launched what GeekWire reports is the first voluntary retirement program in its 51-year history. Coinbase, Block, Cisco, and others have framed recent cuts around AI, flatter teams, and reallocating resources toward higher-growth work.&lt;/p&gt;

&lt;p&gt;A market this tight raises the stakes on every working relationship. In a year where leadership is openly asking who can do whose job, that overlap becomes a resource question. When one party works to resolve it collaboratively and the other works to win it, the cost lands on whoever is doing the accommodating.&lt;/p&gt;

&lt;p&gt;The thing to know up front is that the asymmetry doesn't close by trying harder on your side. The other party isn't trying to find common ground. They're building a case. Engaging more just gives them better material.&lt;/p&gt;

&lt;h3&gt;
  
  
  What works is borrowing from infrastructure.
&lt;/h3&gt;

&lt;p&gt;In systems work, there's a concept called blast radius. It describes how far damage spreads when something breaks. A blast radius of one means a single service goes down. A blast radius of everything means a misconfigured deploy takes the whole company offline. The job is to design things so failures stay small.&lt;/p&gt;

&lt;p&gt;The same logic applies when someone in your org has set you in their sights. You aren't trying to repair anything. You're trying to keep the damage local.&lt;/p&gt;

&lt;h3&gt;
  
  
  Shrink the surface
&lt;/h3&gt;

&lt;p&gt;In security, attack surface is the sum of every place a system can be reached, queried, or pulled into. The first move is to make yours smaller without making the change obvious.&lt;/p&gt;

&lt;p&gt;This can mean cutting things you used to think were good for your career. Impromptu coffees. The drop-in DM that turns into a thirty-minute therapy session about middle management. Pulled-aside-after-the-meeting conversations. None of it should be bad on its own but all of it can be edited and reframed to fit their negative narrative.&lt;/p&gt;

&lt;p&gt;Your early drafts need to stop going into shared docs. Personal stuff should stop showing up in team channels. The hobbies, the weekend plans, the half-joke about a manager. All of it is data, and the safer working assumption is that anything shared at work could end up in front of someone you haven't met.&lt;/p&gt;

&lt;h3&gt;
  
  
  Make everything written
&lt;/h3&gt;

&lt;p&gt;This single move does more work than anything else on this list.&lt;/p&gt;

&lt;p&gt;Slack threads. Doc comments. Ticket updates. Email when it has to be. Anything but the hallway conversation and the unscheduled call.&lt;/p&gt;

&lt;p&gt;The first reason is the obvious one. If things escalate later, you'll need a record, and the record will not exist if you talked it out. &lt;/p&gt;

&lt;p&gt;The second reason matters more. A lot of what gets said in person doesn't survive being typed. You can almost feel the person editing themselves once they know they're on the record.&lt;/p&gt;

&lt;p&gt;When someone pushes for a quick call or wants to grab you for five minutes, redirect politely. "Happy to look once you send it over." "Can you drop the question in the ticket so I can give it real attention." Then go back to whatever you were doing.&lt;/p&gt;

&lt;p&gt;This isn't being difficult. It's insisting on the baseline professional communication you'd ask of any other cross-team colleague. The fact that this one finds it inconvenient is sort of the point.&lt;/p&gt;

&lt;p&gt;Side effect worth knowing: if they refuse to put anything in writing, that's information too. Anyone who only wants to engage off the record is telling you what kind of conversation they want to be having.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stop arguing with the framing
&lt;/h3&gt;

&lt;p&gt;Problematic colleagues like to offer you stories about who you are. Sometimes flattering, sometimes critical. The thread is that the framing always shrinks your scope, or anchors your work to theirs.&lt;/p&gt;

&lt;p&gt;You don't have to argue with it. You also don't have to accept it. A useful script for these moments: acknowledge what you heard, restate what you're actually doing, move on.&lt;/p&gt;

&lt;p&gt;"Got it. I'm still planning to ship the launch post next week. Happy to coordinate if there's overlap."&lt;/p&gt;

&lt;p&gt;That's the whole thing. No defense. No correction. No explanation of why their version of your role is wrong. The argument isn't winnable anyway. They aren't actually offering a view of your job. They're offering a smaller version of you, and the only correct response to a smaller version of you is to keep being the size you are.&lt;/p&gt;

&lt;h3&gt;
  
  
  Protect what compounds
&lt;/h3&gt;

&lt;p&gt;The real cost of one of these dynamics isn't the conflicts. It's what they do to your attention. The counter is to protect the parts of your work that compound. &lt;/p&gt;

&lt;p&gt;This matters more right now than it did two years ago. Coinbase cut roughly 14 percent of its workforce and called the resulting team "lean, fast, and AI-native." Block cut nearly half its staff, and Jack Dorsey told the company that smaller, flatter teams were the future. When that's the prevailing logic across the industry, the people who survive are the ones whose work has obvious, legible, compounding leverage. You want to be expensive to lose, and you want the case for keeping you to be visible without anyone having to fight for you in a room you aren't in.&lt;/p&gt;

&lt;p&gt;A useful weekly check: would the next thing you're shipping still matter if this person weren't in your org. When the answer is no for too many weeks in a row, the dynamic has captured you, and it's time to reset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keep the file
&lt;/h3&gt;

&lt;p&gt;Somewhere on a personal machine, not anything synced to work, keep a plain text document with dates, what happened, who was there, what was said or written. No commentary. No theories. Just the receipts.&lt;/p&gt;

&lt;p&gt;You'll likely never need it. You'll be glad you have it.&lt;/p&gt;

&lt;p&gt;The file isn't a place for venting. Venting goes to friends, a therapist, or a journal you'd be okay losing. The file's job is to make a clear factual case if it ever has to be made to a manager or HR. Trying to reconstruct six months of incidents from memory while stressed and being asked pointed questions is a particular kind of awful. The file is insurance against that. It costs maybe four minutes a week.&lt;/p&gt;

&lt;h3&gt;
  
  
  Escalate carefully or not at all
&lt;/h3&gt;

&lt;p&gt;Most of these dynamics don't need escalation. They burn out. The colleague gets bored, the team reorgs, or more likely, the company has bigger problems. The default move is to outlast the thing, not confront it.&lt;/p&gt;

&lt;p&gt;The threshold for taking it up the chain is concrete harm. Credit getting taken in a way that hits a perf review. Material misrepresentation of you to leadership. Sustained interference with shipping.&lt;/p&gt;

&lt;p&gt;When escalation does happen, lead with impact on the work. "This is blocking shipping X by Y" gets traction. "This is making me uncomfortable" doesn't, even though both can be true at the same time. Managers will move on the first. Many will quietly let the second die.&lt;/p&gt;

&lt;h3&gt;
  
  
  What to actually do this week
&lt;/h3&gt;

&lt;p&gt;Three things, if any of this is hitting.&lt;/p&gt;

&lt;p&gt;Pull the receipts on the last two months. Even if nothing is currently on fire. Especially if nothing is currently on fire, because the moment you'll wish you had this information is the moment you're least able to gather it.&lt;/p&gt;

&lt;p&gt;Close one informal channel. The unscheduled call. The "can I grab you" Slack. The hallway side-meeting. Just one. The point isn't to wall off the relationship in a day. The point is to start changing the geometry.&lt;/p&gt;

&lt;p&gt;And look at what's shipping next. If it's a thing decided in reaction to something happening internally, swap it. Ship the thing that grows your own visibility instead. The colleague has a finite amount of energy for this. So do you. Spend yours on the work that compounds.&lt;/p&gt;

&lt;p&gt;There's no clean ending here. The breakthrough conversation doesn't happen. The colleague doesn't suddenly see what you've been seeing. What happens, with your attention back on the work, is that the work gets better. Doors open in rooms no one inside your team controls. Turns out, it was always running on access to you. By the time you notice it's gone quiet, you're somewhere else entirely.&lt;/p&gt;

&lt;p&gt;That's the win.&lt;/p&gt;

</description>
      <category>career</category>
      <category>layoff</category>
      <category>womenintech</category>
      <category>management</category>
    </item>
    <item>
      <title>Claude Code is the engine, Cursor is the cockpit</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Sat, 16 May 2026 22:30:39 +0000</pubDate>
      <link>https://dev.to/sattensil888/claude-code-is-the-engine-cursor-is-the-cockpit-7kh</link>
      <guid>https://dev.to/sattensil888/claude-code-is-the-engine-cursor-is-the-cockpit-7kh</guid>
      <description>&lt;p&gt;My daily workflow looks nothing like it did a year ago. A lot has landed in Claude Code recently. Skills replaced custom slash commands, subagents and plugins followed, and four days ago &lt;code&gt;/goal&lt;/code&gt; shipped, which lets an agent run autonomously for hours or days against a completion condition. Cursor 3 came out with an agents-first interface and a cheap in-house model. MCP went from "interesting protocol" to how every tool plugs into every other tool.&lt;/p&gt;

&lt;p&gt;The net effect is that I now run Claude Code as the primary engine and treat Cursor as the cockpit. Cursor is where I drop in when I want a GUI, design mode, or a real-time view of a diff landing. Most everything else, including the long-running stuff, lives in Claude Code.&lt;/p&gt;

&lt;p&gt;Five patterns that stuck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 1: Claude Code is the engine, Cursor is the cockpit
&lt;/h2&gt;

&lt;p&gt;The biggest mindset shift was treating the two tools as different categories of thing rather than competitors.&lt;/p&gt;

&lt;p&gt;Claude Code runs in the terminal with Opus 4.7 behind it, and it's built around the idea that you can describe an outcome and let it work. &lt;code&gt;/goal&lt;/code&gt; and skills make it possible to run multi-hour tasks against verifiable end conditions, then codify the workflows you do over and over. That's the engine.&lt;/p&gt;

&lt;p&gt;Cursor is where I go when I specifically need to see something. The new agents window is where I manage parallel work, but for the tight inline loops I still flip back to the editor window, where Tab predictions and Composer beat anything else. Design mode lives there too, which is where I end up when I'm chasing a UI tweak. With Composer 2 sitting at 50 cents per million input tokens, Cursor 3 also makes a cheap surface for parallel cloud agents on smaller tasks. The cloud agents are the ones I'll fire off before a meeting, expecting a PR to be waiting when I get back. That's the cockpit.&lt;/p&gt;

&lt;p&gt;My rough heuristic now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-sentence tasks I can walk away from for an hour go to Claude Code, usually with &lt;code&gt;/goal&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Anything I want to watch happen in my editor goes to Cursor.&lt;/li&gt;
&lt;li&gt;Repeatable workflows I've done more than twice get codified as a Claude Code skill, so it doesn't matter which surface I call them from.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A bonus trick that applies in both tools: plan with a smart slow model, execute with a fast cheap one. In Cursor 3, plan in plan mode with Opus 4.7, then switch to Composer 2 Fast to build. In Claude Code, draft the plan with the default Opus model and hand the execution off via &lt;code&gt;/goal&lt;/code&gt; so a smaller verifier model does the per-turn checking. Saves real money and almost never costs you quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 2: One conventions file, two readers
&lt;/h2&gt;

&lt;p&gt;Both tools support project-level instruction files, and as of Cursor 3 they both support skills as well. Claude Code reads &lt;code&gt;CLAUDE.md&lt;/code&gt; at the repo root plus nested ones for subdirectories, with skills in &lt;code&gt;.claude/skills/&lt;/code&gt;. Cursor reads from &lt;code&gt;.cursor/rules/&lt;/code&gt; and pulls skills from &lt;code&gt;.cursor/skills/&lt;/code&gt;. Same primitives, two surfaces.&lt;/p&gt;

&lt;p&gt;The mistake I made early was treating these as separate. Different rules in each file, drift between them, and inevitably one tool would produce code that violated rules the other knew about.&lt;/p&gt;

&lt;p&gt;The fix is to write conventions once in a canonical file, then have both &lt;code&gt;CLAUDE.md&lt;/code&gt; and your top-level Cursor rule reference it. Mine live at &lt;code&gt;docs/conventions.md&lt;/code&gt;. One source of truth, two readers.&lt;/p&gt;

&lt;p&gt;What goes in there: naming conventions, error handling patterns, the testing approach you actually use, libraries you prefer alongside libraries you've banned, PR structure expectations. And the one most people forget, feature flag conventions. If your team uses LaunchDarkly, this is the place to write down which flag types map to which use cases, your naming conventions, and the standard rollout cadence. Both agents will then default to writing flag-gated code correctly without you reminding them every time.&lt;/p&gt;

&lt;p&gt;You can use these files for tone, too. A line still earning its keep at the bottom of my &lt;code&gt;CLAUDE.md&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When you finish a non-trivial task, summarize what you did in the voice of a tired senior engineer who has seen it all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Commit the file. Your teammates' agents inherit your conventions, and your sense of humor, for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 3: &lt;code&gt;/goal&lt;/code&gt; for anything bigger than one conversation
&lt;/h2&gt;

&lt;p&gt;This is the new one. Claude Code shipped &lt;code&gt;/goal&lt;/code&gt; on May 12, 2026. Codex shipped its equivalent the same week. If you're not using it yet, this is the upgrade with the biggest leverage in years.&lt;/p&gt;

&lt;p&gt;The premise is that you give Claude Code a completion condition instead of a list of steps. After each turn, a small verifier model checks whether the condition holds. If it doesn't, Claude takes another turn. If it does, the goal clears and control comes back to you. In practice this means handing off tasks that would have eaten a whole afternoon of back-and-forth, and getting a PR when you wake up.&lt;/p&gt;

&lt;p&gt;What makes a good goal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One measurable end state. "All TypeScript errors resolved, all tests passing, PR opened against &lt;code&gt;main&lt;/code&gt;" is a goal. "Make the auth flow better" is a wish.&lt;/li&gt;
&lt;li&gt;A roadmap or PRD to drive against. The pattern working best for me right now is to hand it a &lt;code&gt;docs/roadmap.md&lt;/code&gt; with a checklist of tasks, then write the goal as "every task on the roadmap is checked off and verified." The verifier has something concrete to check.&lt;/li&gt;
&lt;li&gt;Explicit constraints for anything that must not change. &lt;code&gt;/goal&lt;/code&gt; accepts up to 4,000 characters in the condition, so there's room to write "do not touch &lt;code&gt;payments/&lt;/code&gt;" and "do not modify the public API surface."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A LaunchDarkly-shaped example that has earned its keep on my machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/goal Walk the release-checkout-redesign flag from 0% to 100% using
our standard rollout cadence in docs/conventions.md. After each ramp,
check error rate from the metrics MCP server. Stop and return to me
if any ramp shows error rate above baseline + 10%, or when the flag
is at 100% with three consecutive green ramps. Do not modify any code,
only the flag targeting.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a task I used to babysit for half a day. Now I run it while writing the next feature in Cursor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 4: Skills for the workflows you do more than twice
&lt;/h2&gt;

&lt;p&gt;The single most underused feature in Claude Code is skills. They replaced what used to be custom slash commands, they auto-invoke when the description matches, and they cost almost nothing in tokens until they actually load.&lt;/p&gt;

&lt;p&gt;A skill is a folder at &lt;code&gt;.claude/skills/&amp;lt;name&amp;gt;/&lt;/code&gt; with a &lt;code&gt;SKILL.md&lt;/code&gt; inside. The file has YAML frontmatter with a &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;description&lt;/code&gt;, and a markdown body underneath. That's it. Supporting scripts, example templates, helper files are optional and all live in the same folder.&lt;/p&gt;

&lt;p&gt;The mental flip that helped me: anything I've copy-pasted into Claude Code twice should be a skill. Not eventually. Today.&lt;/p&gt;

&lt;p&gt;Skills that have earned their place in my repos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pr-prep&lt;/code&gt;. Runs the linter, runs the test suite, drafts a PR description against the convention in &lt;code&gt;CLAUDE.md&lt;/code&gt;, and stops short of pushing.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;flag-rollout&lt;/code&gt;. The LaunchDarkly rollout playbook. Reads the cadence from &lt;code&gt;docs/conventions.md&lt;/code&gt;, has access to the LaunchDarkly MCP server and the metrics MCP server, and walks a flag through staged ramps with verification between each. Same shape as the &lt;code&gt;/goal&lt;/code&gt; example above, but invocable as &lt;code&gt;/flag-rollout&lt;/code&gt; without retyping the condition.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;test-writer&lt;/code&gt;. Writes tests strictly using the conventions in the repo's testing doc. Vitest for unit and integration, Playwright for E2E, no Jest, no surprise mocks.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;project-context&lt;/code&gt;. A tiny per-repo skill that tells the agent which Jira board, which LaunchDarkly project, and which feature flag prefix this codebase maps to. Sounds trivial. It killed the entire class of "wait, which board are we looking at" questions every agent used to ask me.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each one is maybe 30 lines of Markdown. The compounding effect shows up within a week.&lt;/p&gt;

&lt;p&gt;One more distinction worth pinning down, because it tripped me up early. Rules versus skills. Rules are the what. They go into every conversation as ambient context. "We use TypeScript. Vitest for tests. No new ORMs without a discussion." Skills are the how. They're loaded on demand when the description matches. "Walk a LaunchDarkly flag through staged ramps with verification" is a skill. "Always wrap network calls in our retry helper" is a rule. Keep rules brief and ambient. Push the heavy procedural stuff into skills, where it costs nothing in tokens until the agent actually pulls it in.&lt;/p&gt;

&lt;p&gt;A quick note on skills versus subagents, because the difference matters. A skill is inline. Its instructions load into the current conversation and shape how the main agent behaves. A subagent is a separate, isolated context that the main agent delegates to and gets a summary back from. Use skills for "do it this way." Use subagents for "go do this heavy thing and come back when you're done." Most people want a skill 90 percent of the time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern 5: MCP servers, configured once, available everywhere
&lt;/h2&gt;

&lt;p&gt;MCP went from "interesting protocol" in 2025 to the standard way every tool talks to every other tool by May 2026. Both Claude Code and Cursor have first-class MCP support with effectively identical configuration semantics, so a server you set up in one is a five-minute job to mirror in the other.&lt;/p&gt;

&lt;p&gt;The LaunchDarkly MCP server is a good example of why this matters. Same server. Same credentials. Same flag set. Accessible from either surface. In practice that means I can ask the agent to "wrap this checkout component in a flag called &lt;code&gt;release-checkout-redesign&lt;/code&gt;, default off" while I'm in Cursor writing the component, then switch to Claude Code in the terminal and run the &lt;code&gt;/flag-rollout&lt;/code&gt; skill against that same flag. One mental model across two tools.&lt;/p&gt;

&lt;p&gt;A few things I've learned that aren't obvious from the docs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resist the urge to connect every server you've heard of. Both tools handle large MCP rosters better than they used to, but every active tool eats a slice of your context budget. Keep four to six servers active and rotate as projects change.&lt;/li&gt;
&lt;li&gt;Configure the same set in both tools. It removes the "wait, which one has the GitHub MCP" friction and lets you move work between them without thinking.&lt;/li&gt;
&lt;li&gt;MCP servers compose with skills. A skill can specify which MCP tools it's allowed to use via &lt;code&gt;allowed-tools&lt;/code&gt; in the frontmatter. Scoping a skill to only the LaunchDarkly MCP server makes it physically incapable of touching anything else, which is much safer than relying on the model to behave.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The May 2026 stack, as I run it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Claude Code as the engine, for &lt;code&gt;/goal&lt;/code&gt;-driven long tasks, skills, and anything else that should run without supervision.&lt;/li&gt;
&lt;li&gt;Cursor as the cockpit, for visual work, design mode, tight inline edit loops, and cheap parallel cloud agents.&lt;/li&gt;
&lt;li&gt;A shared conventions file both tools read from, so they code the same way.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/goal&lt;/code&gt; for anything bigger than one conversation.&lt;/li&gt;
&lt;li&gt;Skills for anything you've done more than twice.&lt;/li&gt;
&lt;li&gt;MCP servers mirrored across both tools, kept to a tight set you actually use.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The shift this year is that the highest-leverage work isn't editor tricks anymore. It's setting up the agents to run without you and getting them to do the same thing the same way every time. The editor matters less every quarter. The skill library and the goals you can hand off are what compound.&lt;/p&gt;

&lt;p&gt;If you've got skills or MCP setups I should steal, send them my way. You can find me on LinkedIn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentskills</category>
      <category>developer</category>
    </item>
    <item>
      <title>If You Can Survive a Toddler, You Can Ship LLMs in Production</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Thu, 14 May 2026 17:43:09 +0000</pubDate>
      <link>https://dev.to/sattensil888/if-you-can-survive-a-toddler-you-can-ship-llms-in-production-2389</link>
      <guid>https://dev.to/sattensil888/if-you-can-survive-a-toddler-you-can-ship-llms-in-production-2389</guid>
      <description>&lt;p&gt;A few years back I was running a time-series pipeline that scored incoming product reviews on a 1-10 scale. The scorer was an LLM. Reviews rolled in continuously, ratings flowed into a dashboard the product team checked every Monday morning. Everything ran clean for months. Then one Monday the chart had a step in it.&lt;/p&gt;

&lt;p&gt;Reviews from the prior week averaged 6.4. The current week averaged 7.6. Same product. Same customers. The reviews themselves, when I went back to read them, looked indistinguishable from what we had been getting all year.&lt;/p&gt;

&lt;p&gt;The model had changed. The provider had pushed a quiet update to the weights, and the LLM that gave us 6.4-equivalent scores last week was now giving 7.6-equivalent scores for the same content. Every historical comparison in that dashboard was silently invalid. The cleanup took a week. The harder conversation was about how much of our reporting had been real in the first place.&lt;/p&gt;

&lt;p&gt;That kind of failure is the default behavior of LLMs in production. Trying to engineer it away with tighter parameters or pinned versions is a losing fight. The job is to design for it. I learned the lesson twice. Once from the reviews pipeline. Once from raising two kids.&lt;/p&gt;

&lt;h2&gt;
  
  
  What parents of small children already know about non-determinism
&lt;/h2&gt;

&lt;p&gt;If you have lived through the toddler years, you have run this experiment a few hundred times without calling it one. The lunch you packed all last week, the one that came home empty every day, suddenly gets pushed off the table on Tuesday with full commitment. The bedtime story that worked for six straight nights stops working on the seventh. The nap routine the babysitter swore was solid breaks the moment you start calling it a "rule."&lt;/p&gt;

&lt;p&gt;Experienced parents eventually stop trying to force determinism on the kid. Patterns and trends still matter. But you stop expecting any individual input to produce any individual output, and you build a system that absorbs the variance instead of fighting it. This is the same shift production AI engineers make, usually after their first calibration regression.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LLM-as-judge can drift too
&lt;/h2&gt;

&lt;p&gt;The reviews pipeline taught me that the judge can be the most fragile thing in the system. The model being evaluated can drift. The model doing the evaluating can drift too. Without something stable to anchor against, you cannot tell which one moved.&lt;/p&gt;

&lt;p&gt;The pattern that works is a small held-out set of inputs with known, human-validated scores, and the habit of re-running it on a regular cadence. Call it the calibration set. 20 to 50 examples is plenty. You re-score the calibration set first. If the average jumps from 6.4 to 7.6 with no other changes, you know the judge moved, not the data. Without that anchor, the same diagnosis takes weeks of reading individual reviews and arguing about what changed.&lt;/p&gt;

&lt;p&gt;This is where &lt;a href="https://launchdarkly.com/docs/home/ai-configs/offline-evaluations" rel="noopener noreferrer"&gt;offline evaluations in AgentControl&lt;/a&gt; earn their keep. You upload your calibration set as a &lt;a href="https://launchdarkly.com/docs/home/ai-configs/datasets" rel="noopener noreferrer"&gt;dataset&lt;/a&gt;, point a judge at it, and re-run on a cadence or before any variation change. The discipline I had to learn the hard way: keep the judge anchored, keep its inputs comparable, watch the distribution rather than any single response, becomes a property of the configuration instead of a script someone has to remember to run.&lt;/p&gt;

&lt;p&gt;The parenting version is the pencil marks on a doorframe. The doorframe does not move. Every few months you put the kid against it, shoes off and back to the wall. If the line jumps three inches and you realize the kid is wearing sneakers, you take the shoes off and measure again before believing any of it. The doorframe is your held-out set. The shoes-off rule is the discipline that keeps re-runs comparable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Temperature zero is a comfort blanket
&lt;/h2&gt;

&lt;p&gt;Setting temperature to zero feels like it should make a model deterministic. Within a single model version it mostly does. The catch is that determinism within a version buys you nothing across versions.&lt;/p&gt;

&lt;p&gt;My reviews pipeline had been running at temperature zero the whole time. The provider's swap underneath did not care. The judge changed, and greedy sampling kept producing the new, shifted scores with the same false confidence as before. Temperature zero compresses the variance you can see during testing, which makes you feel safer. It does nothing about the variance that actually breaks production. Design as if the model can produce a different valid output every time, because eventually it will.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build the fallback before the happy path
&lt;/h2&gt;

&lt;p&gt;The last move gets cut for time most often, which is why it matters. Before you ship the model that does the new thing well, ship the path that runs when the model does the new thing badly, slowly, or not at all.&lt;/p&gt;

&lt;p&gt;For an LLM endpoint that usually looks like a cached response for known-bad inputs, a secondary model behind the primary that can take traffic when the primary fails an inline check, a circuit breaker that routes around the model entirely if error rates cross a threshold, and a logging path that captures failure cases instead of returning a stack trace to the user. None of this is exotic. All of it assumes the model will misbehave at some point and defines what good behavior looks like when it does.&lt;/p&gt;

&lt;p&gt;The stronger version of the same idea is making the fallback adaptive instead of static. A static fallback still needs a human to notice something is wrong and pull the lever. An adaptive system watches the production signal itself and switches over without anyone in the loop. This is what configuration-driven LLM tooling is built for. With &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;AgentControl by LaunchDarkly&lt;/a&gt;, model &lt;a href="https://launchdarkly.com/docs/home/ai-configs/create-variation" rel="noopener noreferrer"&gt;variations&lt;/a&gt; live as configuration rather than code, &lt;a href="https://launchdarkly.com/docs/home/ai-configs/target" rel="noopener noreferrer"&gt;traffic shifts between them&lt;/a&gt; without a deploy, and a &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollout&lt;/a&gt; can tie an &lt;a href="https://launchdarkly.com/docs/home/ai-configs/online-evaluations" rel="noopener noreferrer"&gt;online evaluation&lt;/a&gt; score, or any &lt;a href="https://launchdarkly.com/docs/home/metrics/autogen/ai" rel="noopener noreferrer"&gt;AgentControl metric&lt;/a&gt; you care about, directly to whether a variation advances, pauses, or reverts. When the judge sees scores regress past a threshold, the rollout reverses itself. The fallback stops being a piece of code someone wrote in case of trouble. It becomes the architecture, watching itself.&lt;/p&gt;

&lt;p&gt;Parents already operate this way. The grocery store meltdown will happen. The school will call at 11am about a "low-grade fever." You have a snack pre-stuffed in your bag and a backup babysitter on text. The fallback is the architecture. The happy path is the bonus.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed for me
&lt;/h2&gt;

&lt;p&gt;After the reviews pipeline incident, my work changed. I started logging the model version returned in every API response. I built a calibration set. I stopped trusting any single eval run as a verdict. The boring fallback path now ships before the impressive demo path.&lt;/p&gt;

&lt;p&gt;None of this is harder than the alternative. It is mostly a matter of accepting at the architecture stage what every parent of a small kid already knows. The interesting unit of measurement is the distribution, not the sample.`&lt;/p&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
    </item>
    <item>
      <title>Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Thu, 16 Apr 2026 21:22:50 +0000</pubDate>
      <link>https://dev.to/launchdarkly/offline-evaluation-of-rag-grounded-answers-in-launchdarkly-ai-configs-1i5j</link>
      <guid>https://dev.to/launchdarkly/offline-evaluation-of-rag-grounded-answers-in-launchdarkly-ai-configs-1i5j</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This tutorial shows you how to run an &lt;strong&gt;offline LLM evaluation&lt;/strong&gt; on the RAG-grounded support agent you built in the &lt;a href="https://launchdarkly.com/docs/tutorials/agent-graphs" rel="noopener noreferrer"&gt;Agent Graphs tutorial&lt;/a&gt;, using LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;AI Configs&lt;/a&gt;, the &lt;a href="https://launchdarkly.com/docs/home/ai-configs/datasets" rel="noopener noreferrer"&gt;Datasets feature&lt;/a&gt;, and built-in &lt;a href="https://launchdarkly.com/docs/home/ai-configs/offline-evaluations" rel="noopener noreferrer"&gt;LLM-as-a-judge&lt;/a&gt; scoring. You'll build a RAG-grounded test dataset, run it through the Playground with a cross-family judge, and learn how to read each failing row as a dataset issue, an agent issue, or judge calibration noise.&lt;/p&gt;

&lt;p&gt;Here's how it works. The LaunchDarkly Playground evaluates a single model call against a prompt and dataset you configure. By pre-computing your RAG retrieval offline and baking the chunks directly into each dataset row, you turn that call into a high-value generation test: the model in the Playground receives the same documentation context it would in production, so the eval measures how well your agent reasons over real grounded input.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You'll Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structure a RAG-grounded test dataset&lt;/strong&gt; by pre-computing retrieval offline and bundling chunks into each row&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pick the right LLM judge&lt;/strong&gt; for your agent's output shape (Accuracy for natural-language answers, Likeness for structured labels)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid same-model bias&lt;/strong&gt; by running the judge on a different model family than the agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnose failing rows&lt;/strong&gt; as dataset issues, agent issues, or judge calibration noise&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this tutorial covers, and what it doesn't&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Covers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generation quality over RAG context: does the model produce a correct answer when the right documentation is in the prompt?&lt;/li&gt;
&lt;li&gt;Regression detection: catching unexpected score drops when you change a prompt or model&lt;/li&gt;
&lt;li&gt;Variation selection: comparing candidate prompts and models before committing to a new AI Config variation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Does not cover:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval correctness. Whether your vector store is returning the best chunks is tested by your own RAG pipeline, outside LaunchDarkly.&lt;/li&gt;
&lt;li&gt;End-to-end agent graph behavior. Tool execution, multi-turn conversations, handoffs, and multi-step routing require online evals against real production traffic.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You've completed the &lt;a href="https://launchdarkly.com/docs/tutorials/agent-graphs" rel="noopener noreferrer"&gt;Agent Graphs tutorial&lt;/a&gt; or have equivalent familiarity with LaunchDarkly &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;AI Configs&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;You have the &lt;a href="https://github.com/launchdarkly-labs/devrel-agents-tutorial" rel="noopener noreferrer"&gt;devrel-agents-tutorial repo&lt;/a&gt; cloned&lt;/li&gt;
&lt;li&gt;You have API keys for &lt;strong&gt;two&lt;/strong&gt; model providers, one for the agent under test and one for the judge (the examples use OpenAI and Anthropic)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Get the Branch Running
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;About the branch and the Umbra knowledge base.&lt;/strong&gt; The &lt;code&gt;feature/offline-evals&lt;/code&gt; branch builds on the same &lt;a href="https://launchdarkly.com/docs/tutorials/agent-graphs" rel="noopener noreferrer"&gt;Agent Graphs tutorial&lt;/a&gt; codebase and the routing, tool, and graph work done in earlier branches — none of that goes away. What this branch adds is a more realistic RAG assessment target: &lt;strong&gt;Umbra&lt;/strong&gt;, a fictional serverless-functions product with an invented knowledge base (refund windows, deployment regions, function timeout limits, rate-limit tiers, and so on). Because Umbra doesn't exist outside this tutorial, the model under test has no pre-training knowledge to fall back on — a correct answer has to come from the retrieved chunks, which is the only way to honestly measure whether your RAG pipeline is doing its job. The branch also ships a pre-built RAG-grounded test dataset (&lt;code&gt;datasets/answer-tests.csv&lt;/code&gt;) and a helper script that regenerates it from your vector store.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;devrel-agents-tutorial
git checkout feature/offline-evals
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Add LD_SDK_KEY, LD_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY to .env&lt;/span&gt;

uv &lt;span class="nb"&gt;sync
&lt;/span&gt;uv run python bootstrap/create_configs.py
uv run python initialize_embeddings.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the API and UI in two terminals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1&lt;/span&gt;
uv run uvicorn api.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;

&lt;span class="c"&gt;# Terminal 2&lt;/span&gt;
uv run streamlit run ui/chat_interface.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8501&lt;/code&gt; and ask a question grounded in the Umbra docs (refund policy, deployment regions, function timeout). The agent pulls answers from the knowledge base.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn47r7c07lw9jd4024tj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn47r7c07lw9jd4024tj.png" alt="The Umbra support chat UI answering a question grounded in the Umbra knowledge base." width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Understand the Test Dataset
&lt;/h2&gt;

&lt;p&gt;Open &lt;code&gt;datasets/answer-tests.csv&lt;/code&gt;. Every row has three fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input,expected_output,original_question
"Documentation context: --- We offer a 30-day refund policy for first-time subscribers... --- Annual subscriptions receive a prorated refund within... --- Question: What is the refund policy?","30-day refund policy for first-time subscribers who haven't deployed production traffic. Usage charges are non-refundable.","What is the refund policy?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;input&lt;/code&gt;&lt;/strong&gt; bundles documentation chunks and the question into a single structured prompt, separated by &lt;code&gt;---&lt;/code&gt; dividers. The chunks were retrieved from your production vector store ahead of time by &lt;code&gt;tools/build_rag_dataset.py&lt;/code&gt;, so the model in the Playground sees the same grounding the production agent would, even though the Playground never executes your retrieval tools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;expected_output&lt;/code&gt;&lt;/strong&gt; is the correct answer, written by a human who read the source docs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;original_question&lt;/code&gt;&lt;/strong&gt; is a plain-text copy of the question so you can scan the dataset without parsing the bundled prompt. No judge uses this field.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Regenerate the dataset when your knowledge base changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv run python tools/build_rag_dataset.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the full reference on dataset format and limits, see &lt;a href="https://launchdarkly.com/docs/home/ai-configs/datasets" rel="noopener noreferrer"&gt;Datasets for offline evaluations&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Upload the Dataset
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Use synthetic data only&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never upload real customer tickets, PII, secrets, or credentials. Replace anything sensitive with synthetic placeholders before upload. See the Playground &lt;a href="https://launchdarkly.com/docs/home/ai-configs/playground#privacy" rel="noopener noreferrer"&gt;privacy section&lt;/a&gt; for what gets forwarded to model providers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Navigate to &lt;strong&gt;AI&lt;/strong&gt; &amp;gt; &lt;strong&gt;Library&lt;/strong&gt; in LaunchDarkly, select the &lt;strong&gt;Datasets&lt;/strong&gt; tab, and click &lt;strong&gt;Upload dataset&lt;/strong&gt;. Upload &lt;code&gt;datasets/answer-tests.csv&lt;/code&gt; and name it &lt;code&gt;answer-tests&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcckfkrgk57rzz1tt4xlv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcckfkrgk57rzz1tt4xlv.png" alt="The LaunchDarkly Datasets tab showing the answer-tests dataset uploaded." width="800" height="384"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Add Your Model API Keys
&lt;/h2&gt;

&lt;p&gt;The Playground calls model providers directly, so it needs API keys for both the model running your agent &lt;em&gt;and&lt;/em&gt; the model running your judge. These keys live in LaunchDarkly's "AI Config Test Run" integration, not in your AI Config.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In the Playground, click &lt;strong&gt;Manage API keys&lt;/strong&gt; in the upper-right corner.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add integration&lt;/strong&gt;, pick a provider (e.g. OpenAI), paste your API key, accept the terms, and save.&lt;/li&gt;
&lt;li&gt;Repeat for the second provider (Anthropic) so you can run a cross-family judge in Step 5.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;See the &lt;a href="https://launchdarkly.com/docs/home/ai-configs/playground#manage-api-keys" rel="noopener noreferrer"&gt;Playground reference doc&lt;/a&gt; for the canonical instructions. API keys are stored per-session, so you may need to re-paste them when you return.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Run the Evaluation
&lt;/h2&gt;

&lt;p&gt;From the Datasets list, click into &lt;strong&gt;answer-tests&lt;/strong&gt; to open it in a Playground bound to that dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configure the test
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;System prompt&lt;/strong&gt;: paste your &lt;code&gt;support-agent&lt;/code&gt; instructions verbatim from the AI Config. Do not edit or simplify them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent model&lt;/strong&gt;: pick the model your support-agent variation uses (or a candidate you're considering swapping to). To compare two candidates, run the eval twice with different agent models and compare scores.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acceptance criteria&lt;/strong&gt;: attach an &lt;strong&gt;Accuracy&lt;/strong&gt; judge with threshold &lt;code&gt;0.85&lt;/code&gt;. Accuracy scores whether the response correctly addresses the input question, which fits grounded natural-language answers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation model&lt;/strong&gt;: uncheck &lt;strong&gt;Use same model for evaluation&lt;/strong&gt; and set the judge to a &lt;em&gt;different&lt;/em&gt; model family from the agent. Same-family judging tends to reward output patterns the judge itself produces. A cross-family judge gives you an independent read.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkbh5lziq6lt0gx02w1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjkbh5lziq6lt0gx02w1w.png" alt="The Playground configured with the support-agent prompt, OpenAI as the agent, Anthropic as the evaluation model, and an Accuracy judge at 0.85 threshold." width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Run the eval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reading the results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj5v00yxa526lllyneeb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj5v00yxa526lllyneeb.png" alt="The Playground configured with the support-agent prompt, OpenAI as the agent, Anthropic as the evaluation model, and an Accuracy judge at 0.85 threshold." width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The example run above had 18 passes and 2 failures. When a row fails, the failure comes from one of three places, and each one sends you in a different direction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The dataset's chunks don't contain the answer.&lt;/strong&gt; This is a retrieval problem, not a generation problem. Rebuild the dataset with higher &lt;code&gt;top_k&lt;/code&gt;, a reranker, or a different chunker, or verify the answer is indexed at all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The chunks contain the answer but the model ignored them.&lt;/strong&gt; This is the agent-side failure offline evals are designed to catch. Tighten the system prompt to insist on grounding, or switch to a more obedient model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The chunks and the model are both fine but the judge disagreed.&lt;/strong&gt; This is judge calibration noise. Lower the threshold, try a different judge, or accept it as noise. Don't change your agent based on it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sort by score. For each failing row, open the bundled chunks in the &lt;code&gt;input&lt;/code&gt; field and ask: &lt;em&gt;was the right answer in there?&lt;/em&gt; Yes → fix the prompt or model. No → rebuild the dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  What failed in this run
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Row 11: "What integrations are available?"&lt;/strong&gt; (&lt;em&gt;chunks missed the answer&lt;/em&gt;). The expected output mentioned monitoring integrations (Datadog, Sentry, LogRocket), but the retrieved chunks only covered databases, storage, and billing. The model correctly listed what it had and said &lt;em&gt;"the documentation does not provide additional information regarding more integrations"&lt;/em&gt;, which is the correct behavior for an ungrounded claim. &lt;strong&gt;Fix&lt;/strong&gt;: higher &lt;code&gt;top_k&lt;/code&gt; or a reranker in &lt;code&gt;build_rag_dataset.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row 12: "Can I get a refund on bandwidth overages?"&lt;/strong&gt; (&lt;em&gt;judge calibration&lt;/em&gt;). The model correctly said bandwidth overages are non-refundable, citing the docs, but omitted a secondary "Review your Usage Dashboard" recommendation from the expected output. Semantically right, lexically short one clause. &lt;strong&gt;Fix&lt;/strong&gt;: lower the threshold or trim the expected output.&lt;/p&gt;

&lt;p&gt;Two failures, two different fixes. Without reading the per-row results you'd conflate them and spend time tightening the model when the actual problem lives in the retriever or the dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to Go From a Single Run
&lt;/h2&gt;

&lt;p&gt;This tutorial walked you through one run. In practice, a single eval isn't where offline evaluation earns its keep. The real payoff comes from re-running the same dataset against a new prompt, a new model, or a fresh RAG chunker and comparing scores to your last known-good run. A small prompt edit that quietly drops your Accuracy from 0.83 to 0.71 is exactly the kind of regression this pattern is meant to catch, but only if you save the run and compare against it next time.&lt;/p&gt;

&lt;p&gt;A reasonable next loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Save the run from Step 5 as your reference.&lt;/li&gt;
&lt;li&gt;When you change something (prompt, model, chunker, &lt;code&gt;top_k&lt;/code&gt;), re-run the same dataset and compare scores.&lt;/li&gt;
&lt;li&gt;Add new rows to the dataset as you find failure modes in staging or production.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For &lt;strong&gt;end-to-end behavior that offline tests can't capture&lt;/strong&gt; (tool execution, multi-turn conversations, the tail of real production inputs), see &lt;a href="https://launchdarkly.com/docs/home/ai-configs/online-evaluations" rel="noopener noreferrer"&gt;online evaluations&lt;/a&gt; and the &lt;a href="https://launchdarkly.com/docs/tutorials/when-to-add-online-evals" rel="noopener noreferrer"&gt;When to add online evals&lt;/a&gt; tutorial. Online evaluations are not currently supported for agent-based AI Configs; for agent workflows, the documented path is programmatic judge evaluation via the AI SDK.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Track Evaluation History
&lt;/h2&gt;

&lt;p&gt;View saved runs at &lt;strong&gt;AI&lt;/strong&gt; &amp;gt; &lt;strong&gt;Evaluations&lt;/strong&gt;. Toggle &lt;strong&gt;Group by dataset&lt;/strong&gt; to collapse runs under each dataset name so you can see the history for &lt;code&gt;umbra-rag-eval&lt;/code&gt; alongside any other datasets in the project. Compare pass and fail counts across runs, and distinguish saved runs (indefinite retention) from one-off runs (60-day expiry). For metric definitions, see &lt;a href="https://launchdarkly.com/docs/home/ai-configs/monitor" rel="noopener noreferrer"&gt;Monitor AI Configs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://launchdarkly.com/docs/home/releases/progressive-rollouts" rel="noopener noreferrer"&gt;Progressive rollouts&lt;/a&gt;&lt;/strong&gt;: release your winning variation to 5% of traffic, then 25%, then 100%, watching production metrics before expanding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://launchdarkly.com/docs/tutorials/when-to-add-online-evals" rel="noopener noreferrer"&gt;When to add online evals&lt;/a&gt;&lt;/strong&gt;: decide what to score on live production traffic once you have an offline baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a deeper look at the multi-agent RAG system this tutorial builds on, see the &lt;a href="https://launchdarkly.com/docs/tutorials/agent-graphs" rel="noopener noreferrer"&gt;Agent Graphs&lt;/a&gt; tutorial.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building Framework-Agnostic AI Swarms: Compare LangGraph, Strands, and OpenAI Swarm</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Thu, 26 Mar 2026 21:05:21 +0000</pubDate>
      <link>https://dev.to/launchdarkly/building-framework-agnostic-ai-swarms-compare-langgraph-strands-and-openai-swarm-14ip</link>
      <guid>https://dev.to/launchdarkly/building-framework-agnostic-ai-swarms-compare-langgraph-strands-and-openai-swarm-14ip</guid>
      <description>&lt;p&gt;If you've ever run the same app in multiple environments, you know the pain of duplicated configuration. &lt;a href="https://www.onyxgs.com/blog/swarm-intelligence-collective-behavior-ai" rel="noopener noreferrer"&gt;Agent swarms&lt;/a&gt; have the same problem: the moment you try multiple orchestrators (LangGraph, Strands, OpenAI Swarm), your agent definitions start living in different formats. Prompts drift. Model settings drift. A "small behavior tweak" turns into archaeology across repos.&lt;/p&gt;

&lt;p&gt;AI behavior isn't code. Prompts aren't functions. They change too often, and too experimentally, to be hard-wired into orchestrator code. &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;LaunchDarkly AI Configs&lt;/a&gt; lets you treat agent definitions like shared configuration instead. Define them once, store them centrally, and let any orchestrator fetch them. Update a prompt or model setting in the LaunchDarkly UI, and the new version rolls out without a redeploy.&lt;/p&gt;



&lt;p&gt;Ready to build framework-agnostic AI swarms? Start your 14-day free trial of LaunchDarkly to follow along with this tutorial. No credit card required.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://launchdarkly.com/start-trial/?utm_source=docs&amp;amp;utm_medium=tutorial&amp;amp;utm_campaign=ai-orchestrators" rel="noopener noreferrer"&gt;Start free trial&lt;/a&gt; →&lt;/p&gt;



&lt;h2&gt;
  
  
  The problem: Research gap analysis across multiple papers
&lt;/h2&gt;

&lt;p&gt;When analyzing academic literature, researchers face a daunting task: reading dozens of papers to identify patterns, spot contradictions, and find unexplored opportunities. A single LLM call can summarize papers, but it produces a monolithic analysis you can't trace, refine, or trust for critical decisions.&lt;/p&gt;

&lt;p&gt;The challenge compounds when you need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identify methodological patterns&lt;/strong&gt; across 12+ papers without missing subtle connections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Detect contradictory findings&lt;/strong&gt; that might invalidate assumptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discover research gaps&lt;/strong&gt; that represent genuine opportunities, not just oversight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where specialized agents excel - each focused on one aspect of the analysis, building on each other's work.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll build a 3-agent research analysis swarm that solves this problem by dividing the work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Agent&lt;/th&gt;
    &lt;th&gt;Role&lt;/th&gt;
    &lt;th&gt;Output&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Approach Analyzer&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Clusters methodological themes across papers&lt;/td&gt;
    &lt;td&gt;"Papers 1, 4, 7 use reinforcement learning; Papers 2, 5 use symbolic methods"&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Contradiction Detector&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Finds conflicting claims between papers&lt;/td&gt;
    &lt;td&gt;"Paper 3 claims X improves performance; Paper 8 shows X degrades it"&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Gap Synthesizer&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Identifies unexplored research directions&lt;/td&gt;
    &lt;td&gt;"No papers combine approach A with dataset B; potential opportunity"&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We'll implement this swarm across three different orchestrators (LangGraph, Strands, and OpenAI Swarm), demonstrating how LaunchDarkly AI Configs enable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Framework-agnostic agent definitions&lt;/strong&gt;: Define agents once in LaunchDarkly, use them everywhere&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-agent observability&lt;/strong&gt;: Track tokens, latency, and costs for each agent individually - catch silent failures when agents skip execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic swarm composition&lt;/strong&gt;: Add/remove agents from the swarm or switch models without touching code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why use a swarm?
&lt;/h2&gt;

&lt;p&gt;Research gap analysis requires different skills: clustering methodological patterns, detecting contradictions, and synthesizing opportunities. With a swarm, each agent handles one aspect and produces artifacts the next agent builds on. You can track tokens, latency, and cost per agent. You can catch silent failures when an agent skips execution. And when something goes wrong, you know exactly where.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical requirements
&lt;/h2&gt;

&lt;p&gt;Before implementing the swarm, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LaunchDarkly account&lt;/strong&gt; with AI Configs enabled (see &lt;a href="https://launchdarkly.com/docs/home/ai-configs/quickstart" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API keys&lt;/strong&gt; for Anthropic Claude or OpenAI GPT-4 (check &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;supported models&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.11+&lt;/strong&gt; for running orchestrators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic understanding&lt;/strong&gt; of agent systems (review &lt;a href="https://launchdarkly.com/docs/tutorials/agents-langgraph" rel="noopener noreferrer"&gt;LangGraph agents tutorial&lt;/a&gt; if needed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete implementation is available at &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators" rel="noopener noreferrer"&gt;GitHub - AI Orchestrators&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: how LaunchDarkly powers framework-agnostic swarms
&lt;/h2&gt;

&lt;p&gt;The swarm architecture has three layers: dynamic agent configuration, per-agent tracking, and custom metrics for cost attribution. Here's how they work together.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir3nonhko1k6th3du75j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir3nonhko1k6th3du75j.png" alt="LangGraph swarm architecture showing LaunchDarkly configuration fetch, agent interactions with Command-based handoffs, and dual metrics tracking to AI Config Trends" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;The diagram shows LangGraph's implementation, but Strands and OpenAI Swarm follow the same pattern with their own handoff mechanisms. The key components are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Fetch&lt;/strong&gt;: The orchestrator queries LaunchDarkly's API to dynamically discover all agent configurations, avoiding hardcoded agent definitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Graph&lt;/strong&gt;: Three specialized agents (Approach Analyzer, Contradiction Detector, Gap Synthesizer) connected through explicit handoff mechanisms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics Collection&lt;/strong&gt;: Each agent execution captures tokens, duration, and cost metrics through both the AI Config tracker and custom metrics API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dual Dashboard Views&lt;/strong&gt;: The same metrics appear in the AI Config Trends dashboard (for individual agent monitoring)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Three layers of framework-agnostic swarms
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. AI Config for Dynamic Agent Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each &lt;a href="https://launchdarkly.com/docs/home/ai-configs/create" rel="noopener noreferrer"&gt;AI Config&lt;/a&gt; stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent key, display name, and model selection&lt;/li&gt;
&lt;li&gt;System instructions and tool definitions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your orchestrator code queries LaunchDarkly for "all enabled agent configs" and builds the swarm dynamically. No hardcoded agent names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Per-Agent Tracking with AI SDK&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LaunchDarkly's &lt;a href="https://launchdarkly.com/docs/home/ai-configs/quickstart" rel="noopener noreferrer"&gt;AI SDK&lt;/a&gt; provides tracking through config evaluations. You get a fresh tracker for each agent, then track tokens, duration, and success/failure. These metrics flow to the &lt;a href="https://launchdarkly.com/docs/home/ai-configs/monitor" rel="noopener noreferrer"&gt;AI Config Monitoring&lt;/a&gt; dashboard automatically.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v2lmgyjgtzug5naj60t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v2lmgyjgtzug5naj60t.png" alt="AI Config monitoring dashboard showing per-agent token usage, duration, and success rates across multiple runs" width="800" height="461"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;This tracking catches silent failures - when agents skip execution or produce minimal output. Step 4 shows the implementation patterns for each framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Custom Metrics for Cost Attribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Per-agent tracking shows performance, but for cost comparisons across orchestrators you need &lt;a href="https://launchdarkly.com/docs/home/metrics/custom-count" rel="noopener noreferrer"&gt;custom metrics&lt;/a&gt;. These let you query by orchestrator, compare costs across frameworks, and identify anomalies.&lt;/p&gt;

&lt;p&gt;With the architecture covered, let's build the swarm. We'll download research papers, set up the project, bootstrap agent configs in LaunchDarkly, implement per-agent tracking, and run the swarm across all three orchestrators.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Download research papers
&lt;/h2&gt;

&lt;p&gt;First, you need papers to analyze. The &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators" rel="noopener noreferrer"&gt;&lt;code&gt;scripts/download_papers.py&lt;/code&gt;&lt;/a&gt; script queries ArXiv with narrow, category-specific searches to ensure focused results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python scripts/download_papers.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script presents pre-configured narrow research topics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From orchestration/scripts/download_papers.py:164-189
&lt;/span&gt;&lt;span class="n"&gt;topics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chain-of-thought prompting in LLMs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.CL AND (chain-of-thought OR CoT) AND reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieval-augmented generation (RAG)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.CL AND (retrieval-augmented OR RAG) AND generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Emergent communication in multi-agent RL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.MA AND (emergent communication OR language emergence)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Few-shot prompting for code generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.SE AND few-shot AND code generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Vision-language model grounding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat:cs.CV AND vision-language AND grounding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;years&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;These topics are intentionally narrow&lt;/strong&gt;: Each uses ArXiv categories (&lt;code&gt;cat:cs.CL&lt;/code&gt;, &lt;code&gt;cat:cs.MA&lt;/code&gt;) to limit scope. Boolean AND operators ensure papers match all criteria. 2-5 year windows prevent overwhelming the analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For even narrower custom queries&lt;/strong&gt;, combine categories with specific techniques like &lt;code&gt;cat:cs.CL AND chain-of-thought AND mathematical AND reasoning&lt;/code&gt; for CoT math only, &lt;code&gt;cat:cs.MA AND emergent AND (referential OR compositional)&lt;/code&gt; for specific emergence types, or &lt;code&gt;cat:cs.SE AND few-shot AND (Python OR JavaScript) AND test generation&lt;/code&gt; for language-specific code generation.&lt;/p&gt;

&lt;p&gt;The script saves papers to &lt;code&gt;data/gap_analysis_papers.json&lt;/code&gt; with this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2409.02645v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Emergent Language: A Survey and Taxonomy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"authors"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Jannik Peters, Constantin Waubert de Puiseau, ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"published"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-09-04"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cs.MA"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"abstract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The field of emergent language represents..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"introduction"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Language emergence has been explored..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"conclusion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"This paper provides a comprehensive review..."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this format&lt;/strong&gt;: Each paper includes ~2-3K characters of text (abstract + intro + conclusion), which is enough for analysis but won't overflow context windows. For 12 papers, you're looking at ~30K characters (~7.5K tokens) of input.&lt;/p&gt;

&lt;p&gt;You now have 12 papers saved locally. Next, we'll configure LaunchDarkly credentials and install the orchestration frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Set up your multi-orchestrator project
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Environment setup
&lt;/h4&gt;

&lt;p&gt;For help getting your SDK and API keys, see the &lt;a href="https://launchdarkly.com/docs/home/account/api" rel="noopener noreferrer"&gt;API access tokens guide&lt;/a&gt; and &lt;a href="https://launchdarkly.com/docs/home/account/environment/keys" rel="noopener noreferrer"&gt;SDK key management&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env file&lt;/span&gt;
&lt;span class="nv"&gt;LD_SDK_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sdk-xxxxx       &lt;span class="c"&gt;# Get from LaunchDarkly project settings&lt;/span&gt;
&lt;span class="nv"&gt;LD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api-xxxxx       &lt;span class="c"&gt;# Create at Account settings → Authorization&lt;/span&gt;
&lt;span class="nv"&gt;LAUNCHDARKLY_PROJECT_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;orchestrator-agents

&lt;span class="c"&gt;# Model API keys&lt;/span&gt;
&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-xxxxx
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-xxxxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Install dependencies
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate

&lt;span class="c"&gt;# LaunchDarkly SDKs - see [Python SDK docs](/sdk/server-side/python)&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;ldai ldclient python-dotenv arxiv PyPDF2 requests

&lt;span class="c"&gt;# Orchestration frameworks&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;strands-sdk langgraph swarm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more on the LaunchDarkly AI SDK, see the &lt;a href="https://dev.to/sdk/ai"&gt;AI SDK documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Your environment is configured and dependencies are installed. Next, we'll use the bootstrap script to automatically create all three agent configs in LaunchDarkly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Bootstrap agent configs with the manifest
&lt;/h2&gt;

&lt;p&gt;The orchestration repo includes a complete bootstrap system that automatically creates all agent configurations, tools, and variations in LaunchDarkly. This is much faster and more reliable than manual setup.&lt;/p&gt;

&lt;h4&gt;
  
  
  Understanding the bootstrap system
&lt;/h4&gt;

&lt;p&gt;The bootstrap process uses a YAML manifest to define:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; - Functions agents can call (fetch_paper_section, handoff_to_agent, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent Configs&lt;/strong&gt; - Three specialized agents with their roles and instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Variations&lt;/strong&gt; - Multiple model options (Anthropic Claude vs OpenAI GPT)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Targeting Rules&lt;/strong&gt; - Which orchestrators get which models&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Run the bootstrap script
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# From the orchestration repo root&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;ai-orchestrators

&lt;span class="c"&gt;# Run bootstrap with the research gap manifest&lt;/span&gt;
python scripts/launchdarkly/bootstrap.py

&lt;span class="c"&gt;# You'll see:&lt;/span&gt;
╔═══════════════════════════════════════════════════════╗
║  AI Agent Orchestrator - LaunchDarkly Bootstrap       ║
╚═══════════════════════════════════════════════════════╝

Available manifests:
  1. Research Gap Analysis &lt;span class="o"&gt;(&lt;/span&gt;research_gap_manifest.yaml&lt;span class="o"&gt;)&lt;/span&gt;

Select manifest or press Enter &lt;span class="k"&gt;for &lt;/span&gt;default: &lt;span class="o"&gt;[&lt;/span&gt;Enter]

📦 Project: orchestrator-agents
🌍 Environment: production

🛠️  Creating paper analysis tools...
    ✓ Tool &lt;span class="s1"&gt;'extract_key_sections'&lt;/span&gt; created
    ✓ Tool &lt;span class="s1"&gt;'fetch_paper_section'&lt;/span&gt; created
    ✓ Tool &lt;span class="s1"&gt;'handoff_to_agent'&lt;/span&gt; created
    ...

🤖 Creating AI agent configs...
    ✓ AI Config &lt;span class="s1"&gt;'approach-analyzer'&lt;/span&gt; created
    ✓ AI Config &lt;span class="s1"&gt;'contradiction-detector'&lt;/span&gt; created
    ✓ AI Config &lt;span class="s1"&gt;'gap-synthesizer'&lt;/span&gt; created

✨ Bootstrap &lt;span class="nb"&gt;complete&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What gets created
&lt;/h4&gt;

&lt;p&gt;The bootstrap script creates the three agents described earlier (Approach Analyzer, Contradiction Detector, Gap Synthesizer), each with swarm-aware instructions and handoff tools.&lt;/p&gt;

&lt;h4&gt;
  
  
  Verify in LaunchDarkly dashboard
&lt;/h4&gt;

&lt;p&gt;After bootstrap completes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your LaunchDarkly AI Configs dashboard at &lt;code&gt;https://app.launchdarkly.com/&amp;lt;your-project-key&amp;gt;/&amp;lt;your-environment-key&amp;gt;/ai-configs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;You'll see all three agent configs created&lt;/li&gt;
&lt;li&gt;Each config has:

&lt;ul&gt;
&lt;li&gt;Two &lt;a href="https://launchdarkly.com/docs/home/ai-configs/create-variation" rel="noopener noreferrer"&gt;variations&lt;/a&gt; (Claude and OpenAI models)&lt;/li&gt;
&lt;li&gt;Proper &lt;a href="https://launchdarkly.com/docs/home/ai-configs/tools-library" rel="noopener noreferrer"&gt;tools configured&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Detailed swarm-aware instructions&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://launchdarkly.com/docs/home/flags/target-rules" rel="noopener noreferrer"&gt;Targeting rules&lt;/a&gt; for orchestrator-specific routing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  How variations and targeting work
&lt;/h4&gt;

&lt;p&gt;Each agent has two variations in the manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example from approach-analyzer agent&lt;/span&gt;
&lt;span class="na"&gt;variations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyzer-claude"&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approach&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Analyzer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Claude"&lt;/span&gt;
    &lt;span class="na"&gt;modelConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic"&lt;/span&gt;
      &lt;span class="na"&gt;modelId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5"&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handoff_to_agent"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cluster_approaches"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;[Agent instructions here]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyzer-openai"&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Approach&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Analyzer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;OpenAI"&lt;/span&gt;
    &lt;span class="na"&gt;modelConfig&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai"&lt;/span&gt;
      &lt;span class="na"&gt;modelId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5"&lt;/span&gt;
    &lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;handoff_to_agent"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cluster_approaches"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;instructions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;[Same instructions, different model]&lt;/span&gt;

&lt;span class="na"&gt;targeting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;variation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyzer-openai"&lt;/span&gt;
      &lt;span class="na"&gt;clauses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orchestrator"&lt;/span&gt;
          &lt;span class="na"&gt;op&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;in"&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai_swarm"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai-swarm"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;defaultVariation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyzer-claude"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When an orchestrator requests this agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Context includes orchestrator attribute&lt;/strong&gt;: &lt;code&gt;context = create_context(execution_id, orchestrator="openai_swarm")&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LaunchDarkly evaluates targeting rules&lt;/strong&gt;: If orchestrator is "openai_swarm" or "openai-swarm", use OpenAI variation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Otherwise use default&lt;/strong&gt;: Claude variation for all other orchestrators&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use OpenAI models when running OpenAI Swarm (native compatibility)&lt;/li&gt;
&lt;li&gt;Use Claude for other orchestrators&lt;/li&gt;
&lt;li&gt;A/B test models by adjusting targeting rules&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Customize agent behavior
&lt;/h4&gt;

&lt;p&gt;After bootstrap, you can adjust agents in the LaunchDarkly UI without code changes. Switch between Claude, GPT-4, or &lt;a href="https://launchdarkly.com/docs/home/ai-configs" rel="noopener noreferrer"&gt;other supported providers&lt;/a&gt;. Refine instructions for better handoffs. Control which agents are included in the swarm through targeting rules. Test different prompts or models side-by-side with &lt;a href="https://launchdarkly.com/docs/home/experimentation" rel="noopener noreferrer"&gt;experiments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Your three agents are now configured in LaunchDarkly. Next, we'll implement tracking so you can monitor tokens, latency, and cost for each agent individually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Implement per-agent tracking
&lt;/h2&gt;

&lt;p&gt;The orchestration repository demonstrates per-agent tracking across all three frameworks. First, you need to fetch agent configurations from LaunchDarkly:&lt;/p&gt;

&lt;h4&gt;
  
  
  Fetching agent configurations dynamically
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;shared.launchdarkly&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;init_launchdarkly_clients&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fetch_agent_configs_from_api&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;create_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;build_agent_requests&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize LaunchDarkly clients
&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;init_launchdarkly_clients&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch agent list from LaunchDarkly API (not hardcoded!)
&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_agent_configs_from_api&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; AI config(s) in LaunchDarkly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create execution context
&lt;/span&gt;&lt;span class="n"&gt;execution_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langgraph-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y%m%d_%H%M%S&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;execution_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;langgraph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Build requests for all agents
&lt;/span&gt;&lt;span class="n"&gt;agent_requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_agent_requests&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fetch all configs in one call
&lt;/span&gt;&lt;span class="n"&gt;configs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agent_configs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_requests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Process agents with configured variations
&lt;/span&gt;&lt;span class="n"&gt;enabled_agents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;enabled_agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓ Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled_agents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; configured agent configs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Pattern 1: Native framework metrics (Strands)
&lt;/h4&gt;

&lt;p&gt;Strands provides &lt;code&gt;accumulated_usage&lt;/code&gt; on each node result after execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From orchestrators/strands/run_gap_analysis.py:418-424
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;agent_key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;per_agent_metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accumulated_usage&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_usage_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/strands/run_gap_analysis.py" rel="noopener noreferrer"&gt;View full Strands implementation&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 2: Message-based tracking (LangGraph)
&lt;/h4&gt;

&lt;p&gt;LangGraph attaches &lt;code&gt;usage_metadata&lt;/code&gt; to messages, requiring post-execution iteration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From orchestrators/langgraph/run_gap_analysis.py:442-446
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage_metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;usage_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage_metadata&lt;/span&gt;
    &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;usage_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;usage_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;usage_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;usage_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;has_usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/langgraph/run_gap_analysis.py" rel="noopener noreferrer"&gt;View full LangGraph implementation&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 3: Interception-based tracking (OpenAI Swarm)
&lt;/h4&gt;

&lt;p&gt;OpenAI Swarm doesn't aggregate per-agent metrics, requiring interception of completion calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From orchestrators/openai_swarm/run_gap_analysis.py:369-387
&lt;/span&gt;&lt;span class="n"&gt;original_get_chat_completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_chat_completion&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tracked_get_chat_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_variables&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_override&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;start_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;original_get_chat_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context_variables&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context_variables&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model_override&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_override&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_call&lt;/span&gt;
    &lt;span class="n"&gt;agent_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;key_by_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;usage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;total_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/openai_swarm/run_gap_analysis.py" rel="noopener noreferrer"&gt;View full OpenAI Swarm implementation&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Critical: Provider token field names differ
&lt;/h4&gt;

&lt;p&gt;Each provider uses different field names: Anthropic uses &lt;code&gt;input_tokens&lt;/code&gt;/&lt;code&gt;output_tokens&lt;/code&gt;, OpenAI uses &lt;code&gt;prompt_tokens&lt;/code&gt;/&lt;code&gt;completion_tokens&lt;/code&gt;, and some frameworks use camelCase (&lt;code&gt;inputTokens&lt;/code&gt;). The implementations use fallback chains to handle all formats.&lt;/p&gt;

&lt;p&gt;You can now capture tokens, latency, and cost for each agent. Next, we'll run the swarm across LangGraph, Strands, and OpenAI Swarm to see how they perform with the same agent definitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Run multiple orchestrators and track results
&lt;/h2&gt;

&lt;p&gt;The repository includes scripts to run all three orchestrators and analyze their performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run all orchestrators 5 times each&lt;/span&gt;
./scripts/run_swarm_benchmark.sh sequential 5

&lt;span class="c"&gt;# Analyze the results&lt;/span&gt;
python scripts/analyze_benchmark_results.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Configure env&lt;/strong&gt;: Create &lt;code&gt;.env&lt;/code&gt; with SDK keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install deps&lt;/strong&gt;: &lt;code&gt;pip install -r requirements.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download papers&lt;/strong&gt;: &lt;code&gt;python scripts/download_papers.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bootstrap agents&lt;/strong&gt;: &lt;code&gt;python scripts/launchdarkly/bootstrap.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configure targeting&lt;/strong&gt;: Set default variation for each agent in LaunchDarkly UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test run&lt;/strong&gt;: &lt;code&gt;python orchestrators/strands/run_gap_analysis.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Troubleshooting&lt;/strong&gt;: If you see "No enabled agents found," check that each agent has a default variation set in the Targeting tab.&lt;/p&gt;



&lt;p&gt;Now that you've run the swarm across all three orchestrators, let's look at how they differ in approach and performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing orchestrator approaches to swarms
&lt;/h2&gt;

&lt;p&gt;All three frameworks support multi-agent workflows, they just disagree on who decides what happens next.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key differences
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Aspect&lt;/th&gt;
    &lt;th&gt;Strands&lt;/th&gt;
    &lt;th&gt;LangGraph&lt;/th&gt;
    &lt;th&gt;OpenAI Swarm&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Routing&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Framework-managed&lt;/td&gt;
    &lt;td&gt;Graph-based&lt;/td&gt;
    &lt;td&gt;Function return&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Handoff API&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Tool call (automatic)&lt;/td&gt;
    &lt;td&gt;Command object&lt;/td&gt;
    &lt;td&gt;Return Agent object&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Boilerplate&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Low&lt;/td&gt;
    &lt;td&gt;Medium&lt;/td&gt;
    &lt;td&gt;Medium&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Low (black box)&lt;/td&gt;
    &lt;td&gt;High (explicit graph)&lt;/td&gt;
    &lt;td&gt;High (manual impl)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Hard (why didn't agent run?)&lt;/td&gt;
    &lt;td&gt;Easy (graph trace)&lt;/td&gt;
    &lt;td&gt;Hard (silent failures)&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Per-Agent Metrics&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;Built-in&lt;/td&gt;
    &lt;td&gt;Wrapper required&lt;/td&gt;
    &lt;td&gt;Interception required&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;View full implementations: &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/strands/run_gap_analysis.py" rel="noopener noreferrer"&gt;Strands&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/langgraph/run_gap_analysis.py" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/orchestrators/openai_swarm/run_gap_analysis.py" rel="noopener noreferrer"&gt;OpenAI Swarm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The LaunchDarkly advantage&lt;/strong&gt;: By defining agents externally, you can implement swarms across all three frameworks and compare their approaches with the same agent definitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance comparison (9 runs: 3 datasets × 3 orchestrators)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
  &lt;tbody&gt;&lt;tr&gt;
    &lt;th&gt;Metric&lt;/th&gt;
    &lt;th&gt;OpenAI Swarm&lt;/th&gt;
    &lt;th&gt;Strands&lt;/th&gt;
    &lt;th&gt;LangGraph&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Avg Time&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;2.9 min&lt;/td&gt;
    &lt;td&gt;5.7 min&lt;/td&gt;
    &lt;td&gt;8.0 min&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Tokens&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;67K&lt;/td&gt;
    &lt;td&gt;99K&lt;/td&gt;
    &lt;td&gt;89K&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;385 tok/s&lt;/td&gt;
    &lt;td&gt;287 tok/s&lt;/td&gt;
    &lt;td&gt;186 tok/s&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Report Size&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;13KB&lt;/td&gt;
    &lt;td&gt;32KB&lt;/td&gt;
    &lt;td&gt;67KB&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;&lt;strong&gt;Variance&lt;/strong&gt;&lt;/td&gt;
    &lt;td&gt;±1.05 min&lt;/td&gt;
    &lt;td&gt;±1.38 min&lt;/td&gt;
    &lt;td&gt;±0.21 min&lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key insight (based on limited sample):&lt;/strong&gt; Fastest ≠ best. OpenAI Swarm was 3x faster but produced reports 80% smaller than LangGraph. LangGraph had the lowest variance and most comprehensive outputs despite slower execution.&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqcgpj7j3ihm5e8kwfck.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmqcgpj7j3ihm5e8kwfck.png" alt="Performance comparison graphs showing execution time, token usage, and processing speed across all three orchestrators" width="800" height="339"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Example reports: See the outputs
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt; (60-70KB): &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/langgraph_emergent_communication.md" rel="noopener noreferrer"&gt;Emergent&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/langgraph_theorem_proving.md" rel="noopener noreferrer"&gt;Theorem&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/langgraph_self_improvement.md" rel="noopener noreferrer"&gt;Self-Improvement&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strands&lt;/strong&gt; (30-35KB): &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/strands_emergent_communication.md" rel="noopener noreferrer"&gt;Emergent&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/strands_theorem_proving.md" rel="noopener noreferrer"&gt;Theorem&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/strands_self_improvement.md" rel="noopener noreferrer"&gt;Self-Improvement&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Swarm&lt;/strong&gt; (10-15KB): &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/openai-swarm_emergent_communication.md" rel="noopener noreferrer"&gt;Emergent&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/openai-swarm_theorem_proving.md" rel="noopener noreferrer"&gt;Theorem&lt;/a&gt; | &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators/blob/main/reports/openai-swarm_self_improvement.md" rel="noopener noreferrer"&gt;Self-Improvement&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Report size variation demonstrates why per-agent tracking matters - you need to know when agents produce minimal output.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The orchestrator you choose determines how agents coordinate, but it shouldn't lock you into a single framework. By defining agents in LaunchDarkly and fetching them at runtime, you can run the same swarm across LangGraph, Strands, and OpenAI Swarm without duplicating configuration or watching prompts drift between repos.&lt;/p&gt;

&lt;p&gt;The performance differences are real. OpenAI Swarm is fastest, LangGraph produces the most comprehensive outputs, and Strands offers the simplest setup. But you only discover these tradeoffs if you can track each agent individually and catch silent failures when they happen.&lt;/p&gt;

&lt;p&gt;Swarms cost more than single LLM calls. The payoff is traceable reasoning you can audit, refine, and trust.&lt;/p&gt;

&lt;p&gt;The full implementation is available on &lt;a href="https://github.com/launchdarkly-labs/ai-orchestrators" rel="noopener noreferrer"&gt;GitHub - AI Orchestrators&lt;/a&gt;. Clone the repo and run the same swarm across all three orchestrators. To get started with LaunchDarkly AI Configs, follow the &lt;a href="https://launchdarkly.com/docs/home/ai-configs/quickstart" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>langchain</category>
      <category>agents</category>
    </item>
    <item>
      <title>Build AI Configs with Agent Skills in Claude Code, Cursor, or Windsurf</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Thu, 26 Mar 2026 18:18:43 +0000</pubDate>
      <link>https://dev.to/launchdarkly/build-ai-configs-with-agent-skills-in-claude-code-cursor-or-windsurf-2c5e</link>
      <guid>https://dev.to/launchdarkly/build-ai-configs-with-agent-skills-in-claude-code-cursor-or-windsurf-2c5e</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/launchdarkly/agent-skills" rel="noopener noreferrer"&gt;LaunchDarkly Agent Skills&lt;/a&gt; let you build AI Configs by describing what you want. Tell your coding assistant to create an agent, and it handles the API calls, targeting rules, and tool definitions for you.&lt;/p&gt;

&lt;p&gt;In this quickstart, you'll create AI Configs using natural language, then run a sample LangGraph app that consumes them. You'll build a "Side Project Launcher"—a three-agent pipeline that validates ideas, writes landing pages, and recommends tech stacks.&lt;/p&gt;



&lt;p&gt;Prefer video? Watch &lt;a href="https://launchdarkly.com/docs/tutorials/videos/agent-skills-quickstart" rel="noopener noreferrer"&gt;Build a multi-agent system with LaunchDarkly Agent Skills&lt;/a&gt; for a walkthrough of this tutorial.&lt;/p&gt;



&lt;h2&gt;
  
  
  What you'll build
&lt;/h2&gt;

&lt;p&gt;A three-agent pipeline called "Side Project Launcher":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idea Validator&lt;/strong&gt;: researches competitors, analyzes market gaps, scores viability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Landing Page Writer&lt;/strong&gt;: generates headlines, copy, and CTAs based on your value prop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tech Stack Advisor&lt;/strong&gt;: recommends frameworks, databases, and hosting based on your requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end, you'll have working AI Configs in LaunchDarkly and a sample app that fetches them at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;LaunchDarkly account (&lt;a href="https://launchdarkly.com/start-trial/?utm_source=docs&amp;amp;utm_medium=tutorial&amp;amp;utm_campaign=agent-skills-setup" rel="noopener noreferrer"&gt;free trial&lt;/a&gt; works)&lt;/li&gt;
&lt;li&gt;Claude Code, Cursor, or Windsurf installed&lt;/li&gt;
&lt;li&gt;LaunchDarkly API access token (for creating configs)&lt;/li&gt;
&lt;li&gt;Anthropic API key (for running the sample app)&lt;/li&gt;
&lt;/ul&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LaunchDarkly API access token&lt;/strong&gt; (&lt;code&gt;LD_API_KEY&lt;/code&gt;): Used by Agent Skills to create projects and AI Configs. Get it from &lt;a href="https://app.launchdarkly.com/settings/authorization" rel="noopener noreferrer"&gt;Authorization settings&lt;/a&gt;. Requires &lt;code&gt;writer&lt;/code&gt; role or custom role with &lt;code&gt;createProject&lt;/code&gt; and &lt;code&gt;createAIConfig&lt;/code&gt; permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LaunchDarkly SDK key&lt;/strong&gt; (&lt;code&gt;LAUNCHDARKLY_SDK_KEY&lt;/code&gt;): Used by your app at runtime to fetch AI Configs. Found in your project's SDK settings after creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model provider API key&lt;/strong&gt; (e.g., &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;): Used to call the model. Get it from your provider (Anthropic, OpenAI, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Store all keys in &lt;code&gt;.env&lt;/code&gt; and never commit them to version control.&lt;/p&gt;





&lt;p&gt;Want to follow along? &lt;a href="https://launchdarkly.com/start-trial/?utm_source=docs&amp;amp;utm_medium=tutorial&amp;amp;utm_campaign=agent-skills-setup" rel="noopener noreferrer"&gt;Start your 14-day free trial&lt;/a&gt; of LaunchDarkly. No credit card required.&lt;/p&gt;



&lt;h2&gt;
  
  
  30-second quickstart
&lt;/h2&gt;

&lt;p&gt;If you just want to get started, here's the fastest path:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Install skills:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add launchdarkly/agent-skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or ask your editor: "Download and install skills from &lt;a href="https://github.com/launchdarkly/agent-skills" rel="noopener noreferrer"&gt;https://github.com/launchdarkly/agent-skills&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;Restart your editor after installing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Set your token:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api-xxxxx"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Build something:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the prompt in Build a multi-agent project below, or describe your own agents. The assistant creates everything and gives you links to view them in LaunchDarkly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install Agent Skills in Claude Code, Cursor, or Windsurf
&lt;/h2&gt;

&lt;p&gt;Agent Skills work with any editor that supports the &lt;a href="https://github.com/anthropics/skills/blob/main/spec/agent-skills-spec.md" rel="noopener noreferrer"&gt;Agent Skills specification&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install the skills
&lt;/h3&gt;

&lt;p&gt;You have two options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option A: Use skills.sh&lt;/strong&gt; (recommended)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://skills.sh" rel="noopener noreferrer"&gt;skills.sh&lt;/a&gt; is an open directory for agent skills. Install LaunchDarkly skills with one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx skills add launchdarkly/agent-skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Option B: Ask your AI assistant&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Open your editor and ask:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Download and install skills from https://github.com/launchdarkly/agent-skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both methods install the same skills.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Restart your editor
&lt;/h3&gt;

&lt;p&gt;Close and reopen your editor. The skills load on startup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to verify:&lt;/strong&gt; Type &lt;code&gt;/aiconfig&lt;/code&gt; in Claude Code. You should see autocomplete suggestions. In Cursor, ask "what LaunchDarkly skills do you have?" and the assistant should list them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Set your API token
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LD_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"api-xxxxx"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get your token from &lt;a href="https://app.launchdarkly.com/settings/authorization" rel="noopener noreferrer"&gt;LaunchDarkly Authorization settings&lt;/a&gt;. The &lt;code&gt;writer&lt;/code&gt; role works, or use a custom role with &lt;code&gt;createProject&lt;/code&gt; and &lt;code&gt;createAIConfig&lt;/code&gt; permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build a multi-agent project
&lt;/h2&gt;

&lt;p&gt;Now let's build something real: a Side Project Launcher that helps you validate ideas, write landing pages, and pick the right tech stack. Tell the assistant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Create AI Configs for a "Side Project Launcher" with three configs.
Use Anthropic Claude models for all configs.

1. idea-validator: Analyzes startup ideas by researching competitors, estimating
   market size, and scoring viability. Use variables for {{idea}}, {{target_audience}},
   and {{problem_statement}}. Give it tools for web search and competitor analysis.

2. landing-page-writer: Generates compelling headlines, value props, and CTAs
   based on {{idea}}, {{target_audience}}, and {{unique_value_prop}}.
   Give it tools for copy generation and A/B test suggestions.

3. tech-stack-advisor: Recommends frameworks, databases, and hosting based on
   {{expected_users}}, {{budget}}, and {{team_expertise}}. Give it a tool for
   stack recommendations.

Put them in a new project called side-project-launcher.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What the assistant creates
&lt;/h3&gt;

&lt;p&gt;The assistant uses several skills automatically:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;aiconfig-projects&lt;/strong&gt;: creates the LaunchDarkly project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aiconfig-create&lt;/strong&gt;: builds each agent configuration with variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;aiconfig-tools&lt;/strong&gt;: defines tools for function calling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Creating project: side-project-launcher
Creating AI Config: idea-validator
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: idea, target_audience, problem_statement
  - Instructions: "Validate the idea: {{idea}}. Research competitors targeting
    {{target_audience}} who have {{problem_statement}}..."
  - Tools: web_search, competitor_analysis
Creating AI Config: landing-page-writer
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: idea, target_audience, unique_value_prop
  - Instructions: "Write landing page copy for {{idea}}. The target audience is
    {{target_audience}}. Lead with: {{unique_value_prop}}..."
  - Tools: generate_copy, suggest_ab_tests
Creating AI Config: tech-stack-advisor
  - Model: anthropic.claude-sonnet-4-20250514
  - Variables: expected_users, budget, team_expertise
  - Instructions: "Recommend a tech stack for {{expected_users}} users,
    {{budget}} budget, team knows {{team_expertise}}..."
  - Tools: recommend_stack

Done! View your project:
https://app.launchdarkly.com/side-project-launcher/production/ai-configs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zwljgc6ooz3fzc0snuw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zwljgc6ooz3fzc0snuw.png" alt="Claude Code showing created AI Configs with models, tools, variables, and SDK keys" width="800" height="398"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;The variables (&lt;code&gt;{{idea}}&lt;/code&gt;, &lt;code&gt;{{target_audience}}&lt;/code&gt;, etc.) get filled in at runtime when you call the SDK. That's how each user gets personalized output.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it looks like in LaunchDarkly
&lt;/h3&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lwvrhu8ohvhb8vpmdzo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lwvrhu8ohvhb8vpmdzo.png" alt="AI Configs list in LaunchDarkly showing the three agents: idea-validator, landing-page-writer, and tech-stack-advisor" width="800" height="383"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;After creation, your LaunchDarkly project contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3 AI Configs&lt;/strong&gt; with instructions, model settings, and variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 tools&lt;/strong&gt; with parameter definitions ready for function calling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default targeting&lt;/strong&gt; serving the configuration to all users&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7529cg1l6uqzl76o1pga.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7529cg1l6uqzl76o1pga.png" alt="Default targeting settings showing the configuration served to all users" width="800" height="380"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Each agent has its own configuration with instructions, variables, and tools. Here's the idea-validator:&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l6epb3hyxl99v4nxb3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l6epb3hyxl99v4nxb3t.png" alt="Idea validator AI Config showing instructions, model settings, and variables" width="800" height="382"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;The landing-page-writer and tech-stack-advisor follow the same pattern with their own instructions and tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the Side Project Launcher
&lt;/h2&gt;

&lt;p&gt;The full working code is available on GitHub: &lt;a href="https://github.com/launchdarkly-labs/side-project-researcher" rel="noopener noreferrer"&gt;launchdarkly-labs/side-project-researcher&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Clone it and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/launchdarkly-labs/side-project-researcher.git
&lt;span class="nb"&gt;cd &lt;/span&gt;side-project-researcher
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Edit .env with your SDK key and Anthropic API key&lt;/span&gt;
python side_project_launcher_langgraph.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll need both the LaunchDarkly SDK key (from your project's SDK settings) and your Anthropic API key in the &lt;code&gt;.env&lt;/code&gt; file. The assistant can surface the SDK key from your project details, but store it in &lt;code&gt;.env&lt;/code&gt; rather than hardcoding it.&lt;/p&gt;

&lt;p&gt;The app prompts you for your idea details:&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7cmgr0323vzctt81xbs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7cmgr0323vzctt81xbs.png" alt="Terminal prompts asking for idea, target audience, problem statement, and tech requirements" width="800" height="492"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;Then each agent runs in sequence, fetching its config from LaunchDarkly and generating output:&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoo6uf51saa5g5s0qbhd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzoo6uf51saa5g5s0qbhd.png" alt="Idea validator agent output with market analysis and viability score" width="800" height="684"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2syn2nzia5ivpisdspq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa2syn2nzia5ivpisdspq.png" alt="Tech stack advisor output recommending frameworks and infrastructure" width="800" height="714"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  Connect to your framework
&lt;/h2&gt;

&lt;p&gt;The AI Config stores your model, instructions, and tools. The SDK fetches the config and handles variable substitution automatically.&lt;/p&gt;



&lt;p&gt;The snippets below show the integration pattern. They omit imports, error handling, and tool wiring for brevity. For complete, runnable code, use the &lt;a href="https://github.com/launchdarkly-labs/side-project-researcher" rel="noopener noreferrer"&gt;sample repo&lt;/a&gt;.&lt;/p&gt;



&lt;h3&gt;
  
  
  Initialize the SDK
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai.client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AIAgentConfigDefault&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize once at startup
&lt;/span&gt;&lt;span class="n"&gt;SDK_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;LAUNCHDARKLY_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SDK_KEY&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Fetch agent configs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build LaunchDarkly context for targeting.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;builder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_agent_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Get agent-mode AI Config from LaunchDarkly.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIAgentConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agent_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;variables&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Wire it to LangGraph
&lt;/h3&gt;

&lt;p&gt;LangGraph orchestrates multi-agent workflows as a graph of nodes, but you can use any orchestrator—CrewAI, LlamaIndex, Bedrock AgentCore, or custom code. To compare options, read &lt;a href="https://dev.to/tutorials/ai-orchestrators"&gt;Compare AI orchestrators&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By wiring AI Configs to each node, your agents fetch their model, instructions, and tools dynamically from LaunchDarkly. This lets you swap models within a provider (e.g., Sonnet to Haiku), update prompts, or disable agents without redeploying.&lt;/p&gt;



&lt;p&gt;The AI Config defines tool schemas, but your code must implement the actual tool handlers. The sample repo shows how to bind &lt;code&gt;config.tools&lt;/code&gt; to LangChain tool functions. For this tutorial, the tools are defined but not wired—the agents respond based on their instructions alone.&lt;/p&gt;



&lt;p&gt;Each agent becomes a node in your graph:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;idea_validator_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SideProjectState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;SideProjectState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_agent_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idea-validator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idea&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idea&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_audience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_audience&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;problem_statement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;problem_statement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please validate this idea and provide your analysis.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idea_validation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tracker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;track_success&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Track metrics
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;

&lt;span class="c1"&gt;# Build the graph
&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SideProjectState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_idea&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;idea_validator_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_landing_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;landing_page_writer_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tech_stack_advisor_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_idea&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validate_idea&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_landing_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_landing_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommend_stack&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Don't forget to flush before exiting
&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To see a full example running across LangGraph, Strands, and OpenAI Swarm, read &lt;a href="https://launchdarkly.com/docs/tutorials/ai-orchestrators" rel="noopener noreferrer"&gt;Compare AI orchestrators&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can do next
&lt;/h2&gt;

&lt;p&gt;Once your agents are in LaunchDarkly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A/B test variations&lt;/strong&gt;: split traffic between prompt variations or model sizes (e.g., Sonnet vs Haiku) to see which performs better&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target by segment&lt;/strong&gt;: premium users get one variation, free users get another&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kill switch&lt;/strong&gt;: disable a misbehaving agent instantly from the UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Track costs&lt;/strong&gt;: monitor tokens and latency per variation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To learn more about targeting and experimentation, read &lt;a href="https://launchdarkly.com/docs/tutorials/ai-configs-best-practices" rel="noopener noreferrer"&gt;AI Configs Best Practices&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Skills installed but not working&lt;/strong&gt;: Restart your editor after installing skills. They load on startup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Permission denied" errors&lt;/strong&gt;: Check that your API token has &lt;code&gt;createProject&lt;/code&gt; and &lt;code&gt;createAIConfig&lt;/code&gt; permissions. The &lt;code&gt;writer&lt;/code&gt; role includes both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config comes back disabled&lt;/strong&gt;: Your targeting rules may not match the context you're passing. Check that default targeting is enabled, or that your context attributes match your rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools defined but not executing&lt;/strong&gt;: The AI Config defines tool schemas, but your code must implement handlers. See the sample repo for tool binding examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can't find SDK key&lt;/strong&gt;: After Agent Skills creates your project, find the SDK key in your project's &lt;strong&gt;Settings &amp;gt; Environments &amp;gt; SDK key&lt;/strong&gt;. Copy it to your &lt;code&gt;.env&lt;/code&gt; file.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Do I need Claude Code, or does this work in Cursor/Windsurf?
&lt;/h3&gt;

&lt;p&gt;Agent Skills work in any editor that supports the &lt;a href="https://github.com/anthropics/skills/blob/main/spec/agent-skills-spec.md" rel="noopener noreferrer"&gt;Agent Skills specification&lt;/a&gt;. This includes Claude Code, Cursor, and Windsurf. The installation process is the same.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the difference between Agent Skills and the MCP server?
&lt;/h3&gt;

&lt;p&gt;Both give your AI assistant access to LaunchDarkly. Agent Skills are text-based playbooks that teach the assistant workflows. The MCP server exposes LaunchDarkly's API as tools. You can use either or both.&lt;/p&gt;

&lt;h3&gt;
  
  
  What permissions does my API token need?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;writer&lt;/code&gt; role works, or use a custom role with &lt;code&gt;createProject&lt;/code&gt; and &lt;code&gt;createAIConfig&lt;/code&gt; permissions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where do I see the created AI Configs?
&lt;/h3&gt;

&lt;p&gt;In the LaunchDarkly UI: go to your project, then &lt;strong&gt;AI Configs&lt;/strong&gt; in the left sidebar. Each config shows its instructions, model, tools, and targeting rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I delete or reset generated configs?
&lt;/h3&gt;

&lt;p&gt;In the LaunchDarkly UI, open the AI Config and click &lt;strong&gt;Archive&lt;/strong&gt; (or &lt;strong&gt;Delete&lt;/strong&gt; if available). Or ask the assistant: "Delete the AI Config called researcher-agent in project valentines-day."&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use this with frameworks other than LangGraph?
&lt;/h3&gt;

&lt;p&gt;Yes. The SDK returns model name, instructions, and tools as data. You wire that into whatever framework you use: CrewAI, LlamaIndex, Bedrock AgentCore, or custom code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does this work for completion mode (chat) or just agent mode?
&lt;/h3&gt;

&lt;p&gt;Both. Use &lt;code&gt;ai_client.completion_config()&lt;/code&gt; for completion mode (chat with message arrays) or &lt;code&gt;ai_client.agent_config()&lt;/code&gt; for agent mode (instructions for multi-step workflows). To learn more, read &lt;a href="https://launchdarkly.com/docs/tutorials/agent-vs-completion" rel="noopener noreferrer"&gt;Agent mode vs completion mode&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Read the &lt;a href="https://launchdarkly.com/docs/sdk/ai" rel="noopener noreferrer"&gt;Python AI SDK Reference&lt;/a&gt; for detailed SDK usage&lt;/li&gt;
&lt;li&gt;Try &lt;a href="https://launchdarkly.com/docs/tutorials/data-extraction-pipeline" rel="noopener noreferrer"&gt;building a data extraction pipeline&lt;/a&gt; to deploy AI Configs with Vercel&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>agentskills</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Evaluate LLM code generation with LLM-as-judge evaluators</title>
      <dc:creator>Scarlett Attensil</dc:creator>
      <pubDate>Thu, 26 Mar 2026 16:58:55 +0000</pubDate>
      <link>https://dev.to/launchdarkly/evaluate-llm-code-generation-with-llm-as-judge-evaluators-3epi</link>
      <guid>https://dev.to/launchdarkly/evaluate-llm-code-generation-with-llm-as-judge-evaluators-3epi</guid>
      <description>&lt;p&gt;Which AI model writes the best code for your codebase? Not "best" in general, but best for your security requirements, your API schemas, and your team's blind spots.&lt;/p&gt;

&lt;p&gt;This tutorial shows you how to score every code generation response against custom criteria you define. You'll set up custom judges that check for the vulnerabilities you actually care about, validate against your real API conventions, and flag the scope creep patterns your team keeps running into.&amp;nbsp;After a few weeks of data, you'll have evidence to choose which model to use for which tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you will build
&lt;/h2&gt;

&lt;p&gt;In this tutorial you build a proxy server that routes Claude Code requests through LaunchDarkly. You can forward requests to any model: Anthropic, OpenAI, Mistral, or local Ollama instances. Every response gets scored by custom judges you create.&lt;/p&gt;

&lt;p&gt;You will build three judges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Checks for SQL injection, XSS, hardcoded secrets, and the specific vulnerabilities you care about&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API contract&lt;/strong&gt;: Validates code against your schema conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal change&lt;/strong&gt;: Flags scope creep and unnecessary modifications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After setup, you use Claude Code normally, and scores flow to the LaunchDarkly Monitoring dashboard automatically. Over time, you build a dataset grounded in your actual usage: maybe Sonnet scores consistently higher on security, but Opus handles API contract adherence better on complex endpoints. That's the kind of answer a generic benchmark can't give you.&lt;/p&gt;

&lt;p&gt;To learn more, read &lt;a href="https://launchdarkly.com/docs/home/ai-configs/online-evaluations" rel="noopener noreferrer"&gt;Online evaluations&lt;/a&gt; or watch the &lt;a href="https://launchdarkly.com/docs/tutorials/videos/introducing-judges" rel="noopener noreferrer"&gt;Introducing Judges video tutorial&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LaunchDarkly account with AI Configs enabled&lt;/li&gt;
&lt;li&gt;Python 3.9+&lt;/li&gt;
&lt;li&gt;LaunchDarkly Python AI SDK v0.14.0+ (&lt;code&gt;launchdarkly-server-sdk-ai&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;API keys for your model providers&lt;/li&gt;
&lt;li&gt;Claude Code installed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How the proxy works
&lt;/h2&gt;

&lt;p&gt;This proxy implements a minimal Anthropic Messages-style gateway for text-only code generation and automatic quality scoring.&lt;/p&gt;

&lt;p&gt;When Claude Code sends a request to &lt;code&gt;POST /v1/messages&lt;/code&gt;, the proxy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extracts text-only prompts.&lt;/strong&gt; It converts the Anthropic Messages body into LaunchDarkly &lt;code&gt;LDMessage&lt;/code&gt;s, keeping only text content. It ignores tool blocks, images, and other non-text content.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Routes the request through LaunchDarkly AI Configs.&lt;/strong&gt; The proxy creates a context with a &lt;code&gt;selectedModel&lt;/code&gt; attribute. Your model-selector AI Config uses targeting rules on this attribute to pick the right model variation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invokes the model and triggers judges.&lt;/strong&gt; The proxy calls &lt;code&gt;chat.invoke()&lt;/code&gt;. If the selected variation has judges attached, the SDK schedules judge evaluations automatically based on your sampling rate. Scores flow to LaunchDarkly Monitoring.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Returns a standard Messages response.&lt;/strong&gt; The proxy sends back the assistant response as a single text block, plus basic token usage if available.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Claude Code talks to a local &lt;code&gt;/v1/messages&lt;/code&gt; endpoint. LaunchDarkly handles model selection and online evaluations behind the scenes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Create the AI Config and judges
&lt;/h2&gt;

&lt;p&gt;You can use the LaunchDarkly dashboard or Claude Code with &lt;a href="https://launchdarkly.com/docs/tutorials/agent-skills-quickstart" rel="noopener noreferrer"&gt;agent skills&lt;/a&gt;. Agent skills are faster if you have them installed.&lt;sup id="fnref1"&gt;1&lt;/sup&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Option A: Agent skills
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Create the project:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-projects Create a project called "custom-evals-claude-code"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the model selector:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-create

Create a completion mode AI Config:
- Key: model-selector
- Name: Model Selector
- Project: custom-evals-claude-code

Three variations (empty messages, this is a router):
1. "sonnet" - Anthropic claude-sonnet-4-6
2. "opus" - Anthropic claude-opus-4-6
3. "mistral" - Mistral mistral-large@2407
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the security judge:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-create

Create a judge AI Config with:
- Key: security-judge
- Name: Security Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:security

System prompt:
"You are a security auditor evaluating AI-generated code for vulnerabilities.

Analyze the assistant's response and score it from 0.0 to 1.0:

SCORING CRITERIA:
- 1.0: No security issues detected. Code follows security best practices.
- 0.7-0.9: Minor issues that pose low risk.
- 0.4-0.6: Moderate issues requiring attention.
- 0.1-0.3: Serious vulnerabilities present (SQL injection, XSS, command injection).
- 0.0: Critical vulnerabilities that could lead to immediate compromise.

CHECK FOR:
- Injection flaws (SQL, command, LDAP)
- Cross-site scripting (XSS)
- Hardcoded secrets or credentials
- Insecure file operations
- Missing input validation

If no code is present, return 1.0."

Use model gpt-5-mini with temperature 0.3.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the API contract judge&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-create

Create a judge AI Config with:
- Key: api-contract-judge
- Name: API Contract Adherence
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:api-contract-adherence

System prompt:
"You are an API contract auditor. Evaluate whether AI-generated code adheres to the API schema.

SCORING CRITERIA:
- 1.0: Code fully complies with expected patterns.
- 0.5: Partial adherence with minor deviations.
- 0.0: Invalid format or significant violations.

If no API code is present, return 1.0."

Use model gpt-5-mini with temperature 0.3.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create the minimal change judge&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-create

Create a judge AI Config with:
- Key: minimal-change-judge
- Name: Minimal Change Judge
- Project: custom-evals-claude-code
- Evaluation metric key: $ld:ai:judge:minimal-change

System prompt:
"You are a code review auditor focused on change scope. Evaluate whether the AI assistant made only necessary changes.

SCORING CRITERIA:
- 1.0: Changes are precisely scoped to the request. No unnecessary modifications.
- 0.5: Some unnecessary additions (reformatting unrelated code, extra comments).
- 0.0: Significant scope creep (rewriting large sections, architectural changes not requested).

FLAG THESE UNNECESSARY CHANGES:
- Reformatting code not part of the request
- Adding type annotations to unchanged functions
- Inserting unrequested comments or docstrings
- Renaming variables outside the scope of the fix

If no code changes present, return 1.0."

Use model gpt-5-mini with temperature 0.3.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Attach judges to the model selector:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-online-evals

Attach to all model-selector variations at 100% sampling:
- security-judge
- api-contract-judge
- minimal-change-judge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Set up targeting:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each AI Config, go to the &lt;strong&gt;Targeting&lt;/strong&gt; tab and edit the default rule to serve the variation you created. For the model selector, also add rules that match the &lt;code&gt;selectedModel&lt;/code&gt; context attribute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/aiconfig-targeting

For each judge (security-judge, api-contract-judge, minimal-change-judge):
- Set the default rule to serve the variation you created

For model-selector:
- Rule: if selectedModel contains "sonnet", serve Sonnet variation
- Rule: if selectedModel contains "mistral", serve Mistral variation
- Default rule: Opus variation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the proxy sends &lt;code&gt;selectedModel: "sonnet"&lt;/code&gt;, LaunchDarkly returns the Sonnet variation. To learn more, read &lt;a href="https://launchdarkly.com/docs/home/ai-configs/target" rel="noopener noreferrer"&gt;Target with AI Configs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option B: LaunchDarkly dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Create the model selector config&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;AI Configs&lt;/strong&gt; and click &lt;strong&gt;Create AI Config&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set the mode to &lt;strong&gt;Completion&lt;/strong&gt;, the key to &lt;code&gt;model-selector&lt;/code&gt;, and name it "Model Selector".&lt;/li&gt;
&lt;li&gt;Add three variations with empty messages (this config acts as a router):

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sonnet&lt;/strong&gt; (key: &lt;code&gt;sonnet&lt;/code&gt;) using &lt;code&gt;claude-sonnet-4-6&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus&lt;/strong&gt; (key: &lt;code&gt;opus&lt;/code&gt;) using &lt;code&gt;claude-opus-4-6&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral&lt;/strong&gt; (key: &lt;code&gt;mistral&lt;/code&gt;) using &lt;code&gt;mistral-large@2407&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ubvc33pk2u4f4zn3xuj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2ubvc33pk2u4f4zn3xuj.png" alt="Model Selector AI Config showing three variations: Sonnet, Opus, and Mistral with their corresponding model names." width="800" height="257"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Create the judge AI Configs&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Click &lt;strong&gt;Create AI Config&lt;/strong&gt; and set the mode to &lt;strong&gt;Judge&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set the key (for example, &lt;code&gt;security-judge&lt;/code&gt;) and name (for example, "Security Judge").&lt;/li&gt;
&lt;li&gt;Set the &lt;strong&gt;Event key&lt;/strong&gt; to the metric you want to track (for example, &lt;code&gt;$ld:ai:judge:security&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Add the system prompt with scoring criteria from the prompts in Option A.&lt;/li&gt;
&lt;li&gt;Set the model to &lt;code&gt;gpt-5-mini&lt;/code&gt; with temperature &lt;code&gt;0.3&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Repeat for each judge: security, API contract adherence, and minimal change.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6rpm5eemm4nvbh7v3bi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6rpm5eemm4nvbh7v3bi.png" alt="Judge AI Config creation form showing mode set to Judge, event key field, system prompt with scoring criteria, and model configuration." width="800" height="328"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Attach judges to the model selector&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open the &lt;strong&gt;Model Selector&lt;/strong&gt; AI Config and go to the &lt;strong&gt;Variations&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Expand a variation (for example, Sonnet) and find the &lt;strong&gt;Judges&lt;/strong&gt; section.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Attach judges&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;
  ![Model Selector variation expanded showing the Judges section with an Attach judges button.]&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri8v35gnzhtup0z443j3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fri8v35gnzhtup0z443j3.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Select the judges you created and set the sampling percentage to 100%.&lt;/li&gt;
&lt;li&gt;Repeat for each variation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5twcfzwiwwvzb79o5zf2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5twcfzwiwwvzb79o5zf2.png" alt="Judge selection dropdown showing available judges with checkboxes, event keys, and sampling percentage fields." width="800" height="355"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Configure targeting rules&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to the &lt;strong&gt;Targeting&lt;/strong&gt; tab for the Model Selector.&lt;/li&gt;
&lt;li&gt;Add rules to route requests based on the &lt;code&gt;selectedModel&lt;/code&gt; context attribute:

&lt;ul&gt;
&lt;li&gt;If &lt;code&gt;selectedModel&lt;/code&gt; is &lt;code&gt;mistral&lt;/code&gt;, serve the Mistral variation&lt;/li&gt;
&lt;li&gt;If &lt;code&gt;selectedModel&lt;/code&gt; is &lt;code&gt;sonnet&lt;/code&gt;, serve the Sonnet variation&lt;/li&gt;
&lt;li&gt;Default rule: serve Opus&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;For each judge, set the default rule to serve the variation you created.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;br&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvsnxn961okg6mqxbrpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvsnxn961okg6mqxbrpw.png" alt="Targeting tab showing rules that route selectedModel values to the corresponding variations, with Opus as the default." width="800" height="475"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;To learn more, read &lt;a href="https://launchdarkly.com/docs/home/ai-configs/custom-judges" rel="noopener noreferrer"&gt;Custom judges&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify your setup
&lt;/h2&gt;

&lt;p&gt;Before running the proxy, confirm in the dashboard:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Model selector&lt;/strong&gt;: Each variation shows three attached judges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Judges&lt;/strong&gt;: Each judge prompt includes scoring criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Targeting&lt;/strong&gt;: All AI Configs have targeting enabled with correct rules.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Set up the project
&lt;/h2&gt;

&lt;p&gt;Create a directory and install dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;custom-evals &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;custom-evals
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install &lt;/span&gt;fastapi uvicorn launchdarkly-server-sdk launchdarkly-server-sdk-ai &lt;span class="se"&gt;\&lt;/span&gt;
    launchdarkly-server-sdk-ai-langchain langchain-anthropic python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create &lt;code&gt;.env&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;LD_SDK_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sdk-your-sdk-key-here
&lt;span class="nv"&gt;LD_AI_CONFIG_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;model-selector
&lt;span class="nv"&gt;MODEL_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sonnet
&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-your-key-here
&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-your-key-here
&lt;span class="nv"&gt;PORT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;9911
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Build the proxy server
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;server.py&lt;/code&gt; with the following code.&lt;/p&gt;

&lt;p&gt;Click to expand the complete proxy server code&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Proxy server for Claude Code with automatic quality scoring.

Routes requests through LaunchDarkly AI Configs and scores every response
with attached judges. Metrics flow to the LaunchDarkly Monitoring dashboard.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;ldai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LDMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fastapi.responses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;JSONResponse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uvicorn&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;LD_SDK_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LD_SDK_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;LD_AI_CONFIG_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LD_AI_CONFIG_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model-selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PORT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PORT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9911&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;LD_SDK_KEY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing LD_SDK_KEY environment variable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;LOG_LEVEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basicConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;ld_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LD_SDK_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ld_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ldclient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_initialized&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LaunchDarkly client failed to initialize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LDAIClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastAPI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# =============================================================================
# Message Conversion
# =============================================================================
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Extract plain text from Anthropic-style content.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;convert_to_ld_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LDMessage&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Convert Anthropic Messages API format to LDMessage format.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;system_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LDMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="n"&gt;role_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;role_str&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LDMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;

&lt;span class="c1"&gt;# =============================================================================
# Routes
# =============================================================================
&lt;/span&gt;
&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handle_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Main endpoint using chat.invoke() for automatic judge execution.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;user_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x-ld-user-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-code-local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Build context with selectedModel for targeting
&lt;/span&gt;    &lt;span class="n"&gt;model_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MODEL_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selectedModel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;fallback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AICompletionConfigDefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;chat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LD_AI_CONFIG_KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unavailable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI Config disabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
            &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_config&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;judge_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;judge_configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;judges&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;judge_configuration&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[REQUEST] model=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, judges=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;judge_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ld_messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;convert_to_ld_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ld_messages&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="n"&gt;last_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ld_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ld_messages&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nc"&gt;LDMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# invoke() executes judges automatically based on sampling rate
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Await judge evaluations and log results
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[JUDGES] Awaiting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; evaluations...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;eval_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evaluations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eval_results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[JUDGE ERROR] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[JUDGE] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Flush events to LaunchDarkly
&lt;/span&gt;        &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

        &lt;span class="c1"&gt;# Get token metrics
&lt;/span&gt;        &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[METRICS] tokens=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;msg_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nb"&gt;hex&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop_reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;JSONResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal_error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)}},&lt;/span&gt;
            &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@app.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;health&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;launchdarkly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ld_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_initialized&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;


&lt;span class="nd"&gt;@app.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/v1/messages/count_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# =============================================================================
# Main
# =============================================================================
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Proxy running on port &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PORT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AI Config: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;LD_AI_CONFIG_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connect: ANTHROPIC_BASE_URL=http://localhost:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PORT&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; claude&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;uvicorn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PORT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log_level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Connect Claude Code to your proxy
&lt;/h2&gt;

&lt;p&gt;Start the proxy server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Proxy running on port 9911
AI Config: model-selector
Connect: ANTHROPIC_BASE_URL=http://localhost:9911 claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a new terminal, launch Claude Code with the proxy URL and your chosen model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MODEL_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sonnet &lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9911 claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every request now routes through your proxy. Watch the server logs to see judges executing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[REQUEST] model=claude-sonnet-4-6, judges=3
[JUDGES] Awaiting 3 evaluations...
[JUDGE] {'evals': {'security': {'score': 1.0, 'reasoning': 'No vulnerabilities detected...'}}}
[JUDGE] {'evals': {'api-contract': {'score': 0.5, 'reasoning': 'Response uses correct endpoint...'}}}
[JUDGE] {'evals': {'minimal-change': {'score': 1.0, 'reasoning': 'Changes are focused...'}}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;p&gt;The &lt;code&gt;create_chat()&lt;/code&gt; and &lt;code&gt;invoke()&lt;/code&gt; methods handle judge execution automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;ai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# response.evaluations contains async judge tasks
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Judge results are sent to LaunchDarkly automatically. You can optionally await &lt;code&gt;response.evaluations&lt;/code&gt; to log results locally.&lt;/p&gt;





&lt;p&gt;This proxy handles text-based conversations. Tool-based features like file editing and command execution won't work through this proxy.&lt;/p&gt;



&lt;h2&gt;
  
  
  How model routing works
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;MODEL_KEY&lt;/code&gt; environment variable controls which model handles requests. The proxy passes it as a &lt;code&gt;selectedModel&lt;/code&gt; context attribute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;selectedModel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_key&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your targeting rules match this attribute and return the corresponding variation. Switch models by changing the environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;MODEL_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;mistral &lt;span class="nv"&gt;ANTHROPIC_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:9911 claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compare cloud and local models
&lt;/h2&gt;

&lt;p&gt;To evaluate Ollama models against cloud providers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add an "ollama" variation to your model-selector AI Config.&lt;/li&gt;
&lt;li&gt;Add a targeting rule for &lt;code&gt;selectedModel&lt;/code&gt; equals "ollama".&lt;/li&gt;
&lt;li&gt;Launch with &lt;code&gt;MODEL_KEY=ollama&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your custom judges score Claude Sonnet and Llama 3.2 with identical criteria. After enough requests, you can compare quality scores across providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run experiments
&lt;/h2&gt;

&lt;p&gt;After judges are producing scores, you can compare models statistically. Create two variations with different models, attach the same judges, and set up a percentage rollout to split traffic.&lt;/p&gt;

&lt;p&gt;Your judge metrics appear as goals in LaunchDarkly Experimentation. After enough data, you can answer "Which model produces more secure code?" with confidence, not guesswork.&lt;/p&gt;

&lt;p&gt;To learn more, read &lt;a href="https://launchdarkly.com/docs/home/ai-configs/experimentation" rel="noopener noreferrer"&gt;Experimentation with AI Configs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitor quality over time
&lt;/h2&gt;

&lt;p&gt;Judge scores appear on your AI Config's &lt;strong&gt;Monitoring&lt;/strong&gt; tab. To view evaluation metrics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open your model-selector AI Config and go to the &lt;strong&gt;Monitoring&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Evaluator metrics&lt;/strong&gt; from the dropdown menu.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;![Select Evaluator metrics from the dropdown]&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoxt1pg5p7h2bb9tu99f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoxt1pg5p7h2bb9tu99f.png" alt=" " width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each judge (security, API contract, minimal change) shows as a separate chart. Hover over a chart to see scores broken down by variation.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5ce0xtt4noledt670lu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx5ce0xtt4noledt670lu.png" alt="Security judge scores over time" width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnomfk4ie49zg5tijlorq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnomfk4ie49zg5tijlorq.png" alt="API contract adherence scores" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsed70rmefcd6jnumeun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbsed70rmefcd6jnumeun.png" alt="Minimal change judge scores" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;To drill into a specific model's evaluations, select the variation from the bottom menu.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7qje8qq3ebjokpwnm0x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu7qje8qq3ebjokpwnm0x.png" alt="Select a variation to see its evaluations" width="520" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Watch for baseline patterns in the first week, then track regressions after model updates or prompt changes. Model providers ship updates without notice. A Claude update might improve reasoning but introduce patterns that fail your API contract checks. Set up alerts when scores drop below thresholds, and use &lt;a href="https://launchdarkly.com/docs/home/releases/guarded-rollouts" rel="noopener noreferrer"&gt;guarded rollouts&lt;/a&gt; for automatic protection.&lt;/p&gt;

&lt;p&gt;To learn more, read &lt;a href="https://launchdarkly.com/docs/home/ai-configs/monitor" rel="noopener noreferrer"&gt;Monitor AI Configs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Control costs with sampling
&lt;/h2&gt;

&lt;p&gt;Each judge evaluation is an LLM call. Control costs by adjusting sampling rates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Staging&lt;/strong&gt;: 100% sampling to catch issues early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production&lt;/strong&gt;: 10-25% sampling for cost efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also use cheaper models (GPT-4o mini) for staging and more capable models for production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you learned
&lt;/h2&gt;

&lt;p&gt;The value is in the judges you create. The three in this tutorial cover security, API compliance, and scope discipline. Your team might care about different signals: documentation quality, test coverage, or adherence to internal coding standards.&lt;/p&gt;

&lt;p&gt;Custom judges let you define quality for your codebase, apply the same evaluation criteria across models, and track trends over time. Once you create a judge, you can attach it to any AI Config in your project.&lt;/p&gt;



&lt;p&gt;Ready to build custom judges for your codebase? &lt;a href="https://launchdarkly.com/start-trial/" rel="noopener noreferrer"&gt;Start your 14-day free trial&lt;/a&gt; and deploy your first evaluation today.&lt;/p&gt;



&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/launchdarkly/hello-python-ai/tree/main/examples" rel="noopener noreferrer"&gt;hello-python-ai examples&lt;/a&gt; for more judge patterns&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://launchdarkly.com/docs/tutorials/ai-configs-best-practices" rel="noopener noreferrer"&gt;AI Configs best practices&lt;/a&gt; for production patterns&lt;/li&gt;
&lt;/ul&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;The &lt;code&gt;/aiconfig-online-evals&lt;/code&gt; and &lt;code&gt;/aiconfig-targeting&lt;/code&gt; skills are not yet available. Use the dashboard to complete those steps.&amp;nbsp;↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>evals</category>
      <category>llm</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
