<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manoranjan Rajguru</title>
    <description>The latest articles on DEV Community by Manoranjan Rajguru (@monuminu).</description>
    <link>https://dev.to/monuminu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1376994%2F8956b907-d30b-4730-b82b-35d338d4fa0c.jpeg</url>
      <title>DEV Community: Manoranjan Rajguru</title>
      <link>https://dev.to/monuminu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/monuminu"/>
    <language>en</language>
    <item>
      <title>Context Engineering: The Developer's Complete Guide to Building High-Performance AI Agents</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Sat, 06 Jun 2026 05:03:43 +0000</pubDate>
      <link>https://dev.to/monuminu/context-engineering-the-developers-complete-guide-to-building-high-performance-ai-agents-54l5</link>
      <guid>https://dev.to/monuminu/context-engineering-the-developers-complete-guide-to-building-high-performance-ai-agents-54l5</guid>
      <description>&lt;h1&gt;
  
  
  Context Engineering: The Developer's Complete Guide to Building High-Performance AI Agents
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The Prompt Engineering Era Is Over&lt;/li&gt;
&lt;li&gt;
What Is Context Engineering?

&lt;ul&gt;
&lt;li&gt;The Anatomy of an Agent&lt;/li&gt;
&lt;li&gt;Why Prompt Engineering Isn't Enough Anymore&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;The 7 Components of a Production Harness&lt;/li&gt;
&lt;li&gt;Memory Architecture: Short-Term vs. Long-Term Context&lt;/li&gt;
&lt;li&gt;MCP: Wiring the World to Your Agent&lt;/li&gt;
&lt;li&gt;The Ratchet Pattern: Turning Agent Mistakes Into Permanent Rules&lt;/li&gt;
&lt;li&gt;Multi-Agent Patterns: Orchestrators, Sub-Agents, and Skills&lt;/li&gt;
&lt;li&gt;Benchmarks &amp;amp; Real-World Evidence&lt;/li&gt;
&lt;li&gt;Getting Started: Build Your First Context-Engineered Agent&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Prompt Engineering Era Is Over
&lt;/h2&gt;

&lt;p&gt;Here's a data point worth sitting with: on Terminal Bench 2.0, Cline's CLI running &lt;strong&gt;claude-opus-4.7&lt;/strong&gt; scored &lt;strong&gt;74.2%&lt;/strong&gt;. Anthropic's own Claude Code, running the &lt;em&gt;same model&lt;/em&gt;, scored &lt;strong&gt;69.4%&lt;/strong&gt;. A ~5-point gap on a competitive benchmark — not from a better model, not from more parameters, not from more training data. Just from a better &lt;strong&gt;harness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That gap is the entire argument for context engineering.&lt;/p&gt;

&lt;p&gt;For the last three years, the dominant mental model in AI development has been: &lt;em&gt;better prompt → better output&lt;/em&gt;. Tweak the system message. Add few-shot examples. Try chain-of-thought. This worked well enough when LLMs were assistants — stateless, turn-by-turn, finishing after one response. But AI has moved on. Agents run for dozens of steps. They call tools, spawn sub-agents, manage memory across sessions, and recover from failures. A static prompt tweaked in a playground simply cannot orchestrate that complexity.&lt;/p&gt;

&lt;p&gt;The field has quietly converged on a new discipline to fill that gap: &lt;strong&gt;context engineering for AI agents&lt;/strong&gt; — the practice of deliberately shaping every token that enters an LLM's context window during an agent's execution lifecycle. HuggingFace just launched a full six-unit certification course on it. O'Reilly published a deep-dive from Google Chrome engineer Addy Osmani on the patterns. LangChain's blog is flooded with harness architecture posts. And Viv Trivedy's viral equation — &lt;code&gt;Agent = Model + Harness&lt;/code&gt; — has become the consensus definition of what an AI agent actually is.&lt;/p&gt;

&lt;p&gt;This post is the engineer's guide to that discipline. By the end, you'll understand what this discipline is, how to decompose a production harness into its seven core components, how to wire external tools through MCP, how to manage short-term and long-term memory, and how to implement the ratchet pattern so your agent gets measurably better with every failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Context Engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; is the discipline of &lt;strong&gt;designing what goes into an agent's context window at every step&lt;/strong&gt;: the system prompt, tool descriptions, conversation history, retrieved knowledge, intermediate reasoning, memory injections, and output format instructions. It's not a one-time decision made before you hit enter — it's an active engineering process that shapes every model call across an agent's entire execution lifecycle.&lt;/p&gt;

&lt;p&gt;The HuggingFace definition is precise:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Context engineering: structuring knowledge so an agent can efficiently find what it needs, when it needs it, to improve its generated outputs."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This sounds simple. It is not. An agent's context window is a scarce, high-stakes resource. Fill it poorly — irrelevant history, redundant tool schemas, imprecise instructions — and the model loses track of the task, hallucinates constraints that don't exist, or burns tokens on reasoning that goes nowhere. Fill it well, and the same model that struggled through a 40-step task completes it cleanly on the first try.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Anatomy of an Agent
&lt;/h3&gt;

&lt;p&gt;To master this practice, you first need to understand the three-layer architecture that every modern AI agent shares:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bi78lc40u7kxqq57j4b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bi78lc40u7kxqq57j4b.png" alt="AI Agent Harness Architecture — three concentric layers: HARNESS (outer, blue), SCAFFOLDING (middle, purple), and MODEL core (inner, glowing)" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 1: The three-layer anatomy of a production AI agent — Model at the core, Scaffolding in the middle, and the Harness as the full execution environment.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Model&lt;/strong&gt; is the LLM at the center: Claude, GPT, Qwen, DeepSeek. It takes text in and produces text out. On its own, it has no memory between calls, no loop, and no ability to execute anything. It produces the &lt;em&gt;intent&lt;/em&gt; to call a tool; it cannot call the tool itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaffolding&lt;/strong&gt; is the behaviour-defining layer wrapped around the model. It includes: the system prompt, tool descriptions, how the model's responses get parsed, what format outputs must follow, and how state is maintained across steps. Scaffolding shapes how the model &lt;em&gt;perceives&lt;/em&gt; its world.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Harness&lt;/strong&gt; is the execution layer: the code that calls the model, handles its tool calls, routes outputs, decides when to stop, manages errors, and feeds results back into context. The harness is what makes the agent &lt;em&gt;run&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Viv Trivedy's equation crystallises this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent = Model + Harness
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are not the model, you are the harness. And this is where the real engineering work lives — the harness is your surface area.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Prompt Engineering Isn't Enough Anymore
&lt;/h3&gt;

&lt;p&gt;Classic prompt engineering optimises a &lt;em&gt;single inference call&lt;/em&gt;. You write a better instruction, you get a better response. The feedback loop is immediate and the surface area is small.&lt;/p&gt;

&lt;p&gt;Agentic systems break this model completely. Consider a coding agent tasked with "refactor this service to add proper error handling." It will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read multiple files to understand the codebase&lt;/li&gt;
&lt;li&gt;Reason about the best error-handling strategy&lt;/li&gt;
&lt;li&gt;Write code changes across potentially dozens of files&lt;/li&gt;
&lt;li&gt;Run tests and interpret their output&lt;/li&gt;
&lt;li&gt;Fix failures — potentially through multiple iterations&lt;/li&gt;
&lt;li&gt;Validate the final result against the original spec&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these steps is a separate model call. Each call has a context window that needs to be carefully curated — relevant history included, irrelevant tokens pruned, tool results formatted for consumption, memory from previous sessions retrieved and injected. &lt;strong&gt;Prompt engineering addresses step 1 of 1. Context engineering addresses every step of every loop.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The 7 Components of a Production Harness
&lt;/h2&gt;

&lt;p&gt;A production-grade agent harness is not a single file with a system prompt. It is an engineered system with seven distinct components, each responsible for a specific aspect of agent behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. System Prompts &amp;amp; Agent Configuration Files&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is your AGENTS.md, CLAUDE.md, or equivalent. These files encode the agent's identity, constraints, conventions, and task-specific knowledge. They are not static documentation — they are &lt;em&gt;active configuration&lt;/em&gt; that shapes every model call. A well-maintained AGENTS.md is the distilled product of your agent's entire failure history (more on this in the Ratchet Pattern section).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Tool Descriptions &amp;amp; Schemas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tools are how your agent reaches outside its context window: filesystems, databases, REST APIs, code execution environments, web search. The &lt;em&gt;description&lt;/em&gt; of each tool — the natural language explanation of what it does, when to use it, and what its parameters mean — is as important as the tool itself. A poorly described tool will be misused or ignored. Tool schema rendering (how the description becomes tokens the model sees) varies significantly across model providers and can account for substantial performance differences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Context Window Policy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A context window policy is the set of rules that govern what gets included, truncated, summarised, or discarded at each step. Key decisions include: how many turns of conversation history to retain; when to summarise rather than retain verbatim; how to format tool results; and when to trigger context compaction (reducing a long context to a shorter, high-fidelity summary). This is one of the most under-engineered components in most agent implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Memory Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Short-term memory (in-context) and long-term memory (external, retrieved on demand) are separate systems with different engineering requirements. See the dedicated section below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Feedback Loops &amp;amp; Backpressure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents that can't detect failure will run confidently in the wrong direction until they hit a hard limit. Feedback loops are the signals that interrupt and redirect: test failures, lint errors, type check results, assertion violations, CI/CD outputs. &lt;em&gt;Backpressure&lt;/em&gt; (the mechanism that prevents an agent from declaring success until all quality signals are green) is what keeps agents honest. Without this, you get agents that "finish" broken code. With it, you get agents that iterate until the code actually works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Hooks &amp;amp; Middleware&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hooks intercept the agent loop at defined points — before tool execution, after model output, on error — to enforce deterministic policies. Examples: a pre-execution hook that blocks &lt;code&gt;git push --force&lt;/code&gt;; a post-output hook that strips personally identifiable information from responses before they're logged; a continuation hook that fires when context compaction is triggered. Hooks give you control points that are not subject to model behaviour — they execute deterministically regardless of what the model decides.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Observability: Traces, Costs, and Latency&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An agent you can't observe is an agent you can't debug. Production harnesses instrument every model call: input tokens, output tokens, tool calls made, tool results received, latency per step, total cost per run. LangSmith, Langfuse, and Helicone are the leading platforms for this. Without traces, a failing agent is a black box. With traces, every failure becomes a legible sequence of events you can replay and fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Architecture: Short-Term vs. Long-Term Context
&lt;/h2&gt;

&lt;p&gt;Memory is the most misunderstood component of agent harness design, and getting it wrong produces some of the most frustrating agent bugs: the agent "forgets" a constraint it was told about, repeats work it already completed, or fails to apply knowledge from a previous session.&lt;/p&gt;

&lt;p&gt;The architecture has two distinct tiers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-term memory&lt;/strong&gt; is everything that lives in the context window during a single agent run. This includes the conversation history (prior turns, tool calls, tool results), intermediate reasoning, and any documents or code the agent has loaded for the current task. Short-term memory is fast and immediately accessible — but it is finite, expensive, and ephemeral. When the run ends, it's gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-term memory&lt;/strong&gt; persists across sessions. It is stored externally — in vector databases, key-value stores, or structured databases — and retrieved on demand, then injected into the context window when relevant. This is how an agent "remembers" that your team never uses &lt;code&gt;console.log&lt;/code&gt; in production code, or that a particular API has a known rate-limiting bug.&lt;/p&gt;

&lt;p&gt;The engineering challenge is the injection layer: deciding &lt;em&gt;what&lt;/em&gt; long-term memory to retrieve, &lt;em&gt;when&lt;/em&gt; to retrieve it, and &lt;em&gt;how&lt;/em&gt; to format it for injection so the model treats it as high-confidence context rather than noise.&lt;/p&gt;

&lt;p&gt;Here's a practical Python pattern for a context-aware memory manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentMemoryManager&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Manages short-term (in-context) and long-term (external) memory
    for a production AI agent harness.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_short_term_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_client&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_short_term_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_short_term_tokens&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;          &lt;span class="c1"&gt;# Current conversation/tool history
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term_store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;      &lt;span class="c1"&gt;# Persistent memory store
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all-MiniLM-L6-v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_to_short_term&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Add a message to short-term (in-context) memory.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="c1"&gt;# Trigger compaction if we're approaching the context limit
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_short_term_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_compact_short_term&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;store_to_long_term&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Persist important information to long-term memory.
        Call this for learnings, constraints, and facts that should
        survive across agent sessions.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve_relevant_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Retrieve the most relevant long-term memories for the current task.
        Returns a list of context strings to inject into the system prompt.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term_store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;long_term_store&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;mem_vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mem_vec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem_vec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1e-8&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;

        &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;top_k&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_context_window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Assemble the full context window by combining:
        1. Long-term memories retrieved for this task (injected as system context)
        2. Current short-term conversation history
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;relevant_memories&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_relevant_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_task&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;injected_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[MEMORY]: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;relevant_memories&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;system_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful coding agent.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                       &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Relevant context from previous sessions:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;injected_context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compact_short_term&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        When short-term memory approaches the token limit, summarise older
        turns to preserve space for new tool calls and reasoning.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="c1"&gt;# Summarise everything except the last 4 messages
&lt;/span&gt;        &lt;span class="n"&gt;to_summarise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

        &lt;span class="n"&gt;summary_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise the following conversation turns concisely, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
             &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preserving all decisions made, constraints established, and tool results.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_summarise&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;summary_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;short_term&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[COMPACTED HISTORY]: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_estimate_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Rough token estimate: ~4 chars per token.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;total_chars&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;total_chars&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;strong&gt;context window management is not passive&lt;/strong&gt;. Every step of the agent loop, your harness must actively decide what to keep, what to compress, and what to retrieve.&lt;/p&gt;




&lt;h2&gt;
  
  
  MCP: Wiring the World to Your Agent
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; is the emerging standard for connecting external tools, APIs, and data sources to any AI agent — regardless of which model or framework you're using. Think of it as the USB-C of agent integrations: one protocol, every tool.&lt;/p&gt;

&lt;p&gt;Before MCP, every agent framework had its own tool integration pattern. LangChain tools looked different from OpenAI function-calling schemas, which looked different from Anthropic tool use, which looked different from custom ReAct implementations. Developers wrote adapters. Tools drifted out of sync. Schema rendering — how a tool description actually becomes tokens the model sees — varied wildly across providers, producing silent performance differences.&lt;/p&gt;

&lt;p&gt;MCP solves this by defining a client-server protocol: &lt;strong&gt;MCP servers&lt;/strong&gt; expose tools, resources, and prompts through a standard interface; &lt;strong&gt;MCP clients&lt;/strong&gt; (your agent harness) connect to those servers and translate their capabilities into whatever the underlying model expects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7j9iet5ruqf3joii5ce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy7j9iet5ruqf3joii5ce.png" alt="Model Context Protocol Architecture — AI Agent connects through MCP Layer to File System, Database, REST APIs, Web Browser, Code Executor, and External Services" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 2: MCP as the universal integration layer — one protocol connecting your agent to every tool and data source.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's a minimal MCP server implementation in Python exposing two tools — a file reader and a shell executor — that any MCP-compatible agent can consume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Server&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.stdio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stdio_server&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="c1"&gt;# Initialise the MCP server
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev-tools-mcp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@app.list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Declare the tools this MCP server exposes to any connected agent.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Read the full contents of a file at the given path. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use this to inspect source code, configs, or data files. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prefer this over bash `cat` for files under 500 lines.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Absolute or relative path to the file to read.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_shell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Execute a shell command and return stdout + stderr. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use for running tests, builds, linters, or system queries. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER use for destructive operations (rm -rf, git push --force) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;without explicit user confirmation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;inputSchema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The shell command to execute.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max seconds to wait. Defaults to 30.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nd"&gt;@app.call_tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Handle tool calls routed from the agent harness.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: File not found: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_shell&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;command&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timeout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;shell&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;STDOUT:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;STDERR:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;EXIT CODE: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TimeoutExpired&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Command timed out after &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TextContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;stdio_server&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="nf"&gt;as &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_initialization_options&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice the tool &lt;em&gt;descriptions&lt;/em&gt;. They are not just labels — they are load-bearing instructions. "Prefer this over bash &lt;code&gt;cat&lt;/code&gt; for files under 500 lines." "NEVER use for destructive operations without explicit user confirmation." The model reads these descriptions at every step. Every word is a nudge toward or away from correct behaviour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool description quality is context engineering.&lt;/strong&gt; A tool named &lt;code&gt;exec&lt;/code&gt; with description "executes commands" will be used unpredictably. The same tool with a precise description constraining when and how to use it behaves measurably better — without changing a single line of tool logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Ratchet Pattern: Turning Agent Mistakes Into Permanent Rules
&lt;/h2&gt;

&lt;p&gt;The most powerful operational practice in harness engineering is also the most counterintuitive: &lt;strong&gt;you should want your agent to fail&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not randomly, not catastrophically — but every mistake is a diagnostic. An agent that ships a PR with commented-out tests isn't a broken model; it's a misconfigured harness that lacks the constraint "never comment out tests." Add that constraint, and the agent never makes that mistake again. For every agent. On every run. Forever.&lt;/p&gt;

&lt;p&gt;This is the ratchet pattern: the harness only ever gets more capable, never less. Every mistake tightens a ratchet tooth.&lt;/p&gt;

&lt;p&gt;Here's how it works in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Observe the failure.&lt;/strong&gt; The agent deleted a &lt;code&gt;.env&lt;/code&gt; file that was gitignored. The agent added &lt;code&gt;console.log&lt;/code&gt; statements in production code. The agent "finished" a task with 12 failing tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Identify the harness component responsible.&lt;/strong&gt; Is it a missing constraint in AGENTS.md? A missing pre-execution hook? A missing feedback signal (the agent didn't know the tests were failing because you hadn't wired test output back into context)?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Add the fix to the harness — not to the prompt for that one run.&lt;/strong&gt; If it fixes a class of failures, it belongs in the harness permanently.&lt;/p&gt;

&lt;p&gt;Here's a sample &lt;code&gt;AGENTS.md&lt;/code&gt; structure that embeds learned constraints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# AGENTS.md — Coding Agent Configuration&lt;/span&gt;
&lt;span class="gh"&gt;# Every constraint below is traceable to a specific production failure.&lt;/span&gt;

&lt;span class="gu"&gt;## Identity&lt;/span&gt;
You are a senior backend engineer working on a Python/FastAPI service.
Use Python 3.11+ syntax. Prefer &lt;span class="sb"&gt;`httpx`&lt;/span&gt; over &lt;span class="sb"&gt;`requests`&lt;/span&gt; for async HTTP.

&lt;span class="gu"&gt;## Code Standards&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; NEVER comment out tests. Delete them or fix them — never &lt;span class="sb"&gt;`.skip()`&lt;/span&gt; or &lt;span class="sb"&gt;`xit()`&lt;/span&gt;.
&lt;span class="p"&gt;-&lt;/span&gt; NEVER include &lt;span class="sb"&gt;`console.log`&lt;/span&gt;, &lt;span class="sb"&gt;`print()`&lt;/span&gt;, or &lt;span class="sb"&gt;`pdb.set_trace()`&lt;/span&gt; in committed code.
&lt;span class="p"&gt;-&lt;/span&gt; ALL new functions must have type annotations and a docstring.
&lt;span class="p"&gt;-&lt;/span&gt; Run &lt;span class="sb"&gt;`ruff check .`&lt;/span&gt; and &lt;span class="sb"&gt;`mypy src/`&lt;/span&gt; before marking any task complete.

&lt;span class="gu"&gt;## File Operations&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; NEVER delete, overwrite, or move files in &lt;span class="sb"&gt;`./config/`&lt;/span&gt; or &lt;span class="sb"&gt;`.env*`&lt;/span&gt; without
  explicit user confirmation in the same session.
&lt;span class="p"&gt;-&lt;/span&gt; ALWAYS check &lt;span class="sb"&gt;`git status`&lt;/span&gt; before staging files to avoid committing secrets.

&lt;span class="gu"&gt;## Task Completion Criteria&lt;/span&gt;
A task is NOT complete until:
&lt;span class="p"&gt;1.&lt;/span&gt; &lt;span class="sb"&gt;`pytest`&lt;/span&gt; exits with 0 failures
&lt;span class="p"&gt;2.&lt;/span&gt; &lt;span class="sb"&gt;`ruff check .`&lt;/span&gt; exits with 0 warnings
&lt;span class="p"&gt;3.&lt;/span&gt; &lt;span class="sb"&gt;`mypy src/`&lt;/span&gt; exits with 0 errors
&lt;span class="p"&gt;4.&lt;/span&gt; You have confirmed the change does what the user asked

&lt;span class="gu"&gt;## Escalation&lt;/span&gt;
If you encounter any of the following, STOP and ask the user:
&lt;span class="p"&gt;-&lt;/span&gt; Database migration changes
&lt;span class="p"&gt;-&lt;/span&gt; Changes to authentication/authorization logic
&lt;span class="p"&gt;-&lt;/span&gt; Modifications to CI/CD configuration files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pair this with a pre-commit hook that enforces the same constraints deterministically (bypassing the model entirely for checks that can be automated):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# .git/hooks/pre-commit&lt;/span&gt;
&lt;span class="c"&gt;# Mirror of AGENTS.md constraints as deterministic pre-commit checks&lt;/span&gt;

&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"🔍 Running harness pre-commit checks..."&lt;/span&gt;

&lt;span class="c"&gt;# Check 1: No commented-out tests&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;skip&lt;/span&gt;&lt;span class="se"&gt;\(\|&lt;/span&gt;&lt;span class="s2"&gt;xit(&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;# def test_&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;# test_"&lt;/span&gt; &lt;span class="nt"&gt;--include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"*.py"&lt;/span&gt; src/ tests/&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Commented-out tests found. Delete or fix them."&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# Check 2: No debug statements&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-rn&lt;/span&gt; &lt;span class="s2"&gt;"pdb&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;set_trace&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;breakpoint()&lt;/span&gt;&lt;span class="se"&gt;\|&lt;/span&gt;&lt;span class="s2"&gt;print("&lt;/span&gt; &lt;span class="nt"&gt;--include&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"*.py"&lt;/span&gt; src/&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
    &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Debug statements found in src/. Remove before committing."&lt;/span&gt;
    &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# Check 3: Ruff linting&lt;/span&gt;
ruff check &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ Ruff check failed."&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Check 4: Type checking&lt;/span&gt;
mypy src/ &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"❌ mypy check failed."&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;exit &lt;/span&gt;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"✅ All harness checks passed."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result: your AGENTS.md and pre-commit hooks become the accumulated wisdom of every failure your agent has ever made. &lt;strong&gt;This is institutional knowledge, versioned in git, applied at every run.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Agent Patterns: Orchestrators, Sub-Agents, and Skills
&lt;/h2&gt;

&lt;p&gt;Single-agent systems hit a ceiling quickly. A 40-step coding task doesn't just stress the context window — it conflates planning, execution, and validation into a single agent that is simultaneously trying to be a strategist, a code writer, and a reviewer. These are different cognitive modes, and mixing them degrades performance on all three.&lt;/p&gt;

&lt;p&gt;The solution is to decompose. Multi-agent systems assign these roles to specialised agents coordinated by an orchestrator.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7wp1oha75hkq98qupdoy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7wp1oha75hkq98qupdoy.png" alt="Multi-Agent Orchestration Pattern — Orchestrator Agent at top delegating to Code Agent, Research Agent, and Review Agent below, with shared memory store at the bottom" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 3: Multi-agent orchestration — the Orchestrator decomposes tasks and delegates to specialised sub-agents, each with their own context-optimised harness.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The vocabulary here matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Orchestrator&lt;/strong&gt;: A higher-level controller that decomposes tasks and delegates to sub-agents. It reasons about &lt;em&gt;what&lt;/em&gt; needs to be done, not &lt;em&gt;how&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-agent&lt;/strong&gt;: An agent called by the orchestrator to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. It can itself spawn further sub-agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skill&lt;/strong&gt;: A reusable, packaged bundle of context — tool descriptions, prompts, and instructions — that enables a specific capability. Where a &lt;em&gt;tool&lt;/em&gt; is a function call, a &lt;em&gt;skill&lt;/em&gt; is everything the agent needs to use that capability well.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a minimal LangGraph implementation of a planner-executor-reviewer multi-agent loop for a code generation task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph.message&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Shared agent state ───────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;review_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;iteration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Model instances (could use different models per role) ────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;planner_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;coder_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reviewer_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Planner sub-agent ────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;planner_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Decomposes the task into a step-by-step implementation plan.
    Does NOT write code — only plans.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior software architect. Your job is to create a detailed, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step-by-step implementation plan. Do NOT write code. Output numbered steps only.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Coder sub-agent ──────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;coder_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Implements the plan produced by the planner.
    Focuses purely on code quality and correctness.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;coder_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a senior Python engineer. Implement the following plan as clean, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type-annotated Python code with docstrings. Include unit tests. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output only the code block.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Implementation Plan:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Previous review feedback (if any): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;review_result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Reviewer sub-agent ───────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reviewer_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Reviews the generated code against quality criteria.
    Returns APPROVED or a list of specific issues to fix.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reviewer_llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a strict code reviewer. Check for: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(1) correctness, (2) type annotations, (3) error handling, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(4) test coverage, (5) no debug statements. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;If all criteria pass, respond with exactly &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;APPROVED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Otherwise, list specific issues to fix.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)),&lt;/span&gt;
        &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Code to review:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Routing logic ────────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Route back to coder if review failed, or end if approved or max iterations hit.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROVED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Safety: max 3 revision loops
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Build the graph ──────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;planner_agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;coder_agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reviewer_agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;planner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# ─── Run it ───────────────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a rate limiter class using a sliding window algorithm with Redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iteration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final Code:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review Status:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;review_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key design principle here is &lt;strong&gt;separation of cognitive concerns&lt;/strong&gt;: the planner never writes code, the coder never reviews, and the reviewer never plans. Each sub-agent's context window is optimised for its specific role. This produces measurably better results than a single agent trying to do all three simultaneously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks &amp;amp; Real-World Evidence
&lt;/h2&gt;

&lt;p&gt;The case for harness-first engineering is not theoretical. The evidence is accumulating in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminal Bench 2.0 (June 2026):&lt;/strong&gt; Cline CLI running claude-opus-4.7 scored 74.2% vs. Anthropic's Claude Code scoring 69.4% on the same model. The difference is entirely attributable to harness design. Cline's newly open-sourced &lt;code&gt;@cline/sdk&lt;/code&gt; makes this harness available for inspection — it's a four-layer TypeScript stack with native sub-agent support, CRON scheduling, checkpointing, and MCP connectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Viv Trivedy's team (O'Reilly):&lt;/strong&gt; Moved a custom coding agent from the Top 30 to Top 5 on the same benchmark by changing only the harness. The model was unchanged. The improvements were: tighter AGENTS.md constraints, better tool descriptions, a typecheck backpressure signal, and a planner/executor split.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Uber (June 2026):&lt;/strong&gt; Capped AI coding tool spending at $1,500/month per tool per engineer after blowing their annual AI budget in four months. At that cap, AI tool spend per engineer approaches ~11% of median annual compensation — a signal that organisations are extracting real productivity value, and that harness quality directly affects cost efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rippling:&lt;/strong&gt; Rebuilt every product AI-native in 6 months using LangGraph multi-agent patterns. Every agent in their stack uses a context-engineered harness rather than raw model calls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lyft:&lt;/strong&gt; Built a self-serve AI agent platform for customer support using LangGraph. The platform handles multi-turn support conversations with tool access to booking systems, payment APIs, and account management — all orchestrated through careful harness design.&lt;/p&gt;

&lt;p&gt;The pattern across these cases is consistent: &lt;strong&gt;the limiting factor isn't model capability, it's harness design.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started: Build Your First Context-Engineered Agent
&lt;/h2&gt;

&lt;p&gt;Here is a concise checklist and minimal harness skeleton to get you from zero to a context-engineered agent:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pip install openai langchain-openai langgraph mcp sentence-transformers&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;An OpenAI API key (or equivalent)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Your starter harness skeleton:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
minimal_harness.py — A context-engineered agent harness from scratch.

This implements the core loop: Plan → Execute → Observe → Decide → Repeat
with proper context management and feedback backpressure.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Reads OPENAI_API_KEY from env
&lt;/span&gt;
&lt;span class="c1"&gt;# ─── 1. Agent Configuration (your AGENTS.md equivalent) ──────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a Python coding assistant. You help write clean, tested, typed Python code.

CONSTRAINTS (all are hard rules — never violate them):
- All functions must have type annotations and docstrings
- Never include print() or debug statements in final code
- Always include unit tests alongside implementation code
- A task is complete only when all tests pass

TOOLS AVAILABLE:
- run_python: Execute Python code and return stdout/stderr
- read_file: Read a file by path
- write_file: Write content to a file

When you call a tool, use the function calling format. Be precise with file paths.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="c1"&gt;# ─── 2. Tool Definitions (your MCP schemas, inline for simplicity) ────────────
&lt;/span&gt;
&lt;span class="n"&gt;TOOLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Execute Python code in an isolated subprocess. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use this to test your implementations, run assertions, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;or verify outputs. Returns stdout and stderr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python code to execute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ─── 3. Tool Execution Layer (your harness execution) ─────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute a tool call and return the result as a string.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
            &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stdout: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;stderr: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;exit: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Unknown tool &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# ─── 4. The Agent Loop (harness + context management) ─────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Core agent loop with context management.
    Implements: call model → handle tool use → feed results back → repeat.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Short-term memory: starts with system prompt + user task
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# ── Model call ──────────────────────────────────────────────────────
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;

        &lt;span class="c1"&gt;# ── Append model response to context (short-term memory update) ─────
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# ── Check stopping condition ──────────────────────────────────────
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Agent completed in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; iterations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

        &lt;span class="c1"&gt;# ── Execute tool calls and feed results back into context ─────────
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
                &lt;span class="n"&gt;tool_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔧 Calling tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="c1"&gt;# Inject tool result back into context window
&lt;/span&gt;                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max iterations reached. Task incomplete.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# ─── 5. Run it ────────────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python function that implements binary search with full type annotations, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;a docstring, and pytest unit tests. Then run the tests to verify they pass.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;📄 Final output:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Your context engineering checklist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;AGENTS.md written&lt;/strong&gt; — at minimum: role identity, code standards, task completion criteria, escalation triggers&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Tool descriptions reviewed&lt;/strong&gt; — each tool's description should constrain &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;how&lt;/em&gt; to use it, not just &lt;em&gt;what&lt;/em&gt; it does&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Context window policy defined&lt;/strong&gt; — how many turns to retain, when to compact, what to summarise&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Long-term memory wired&lt;/strong&gt; (for multi-session agents) — vector store or database for persistent context retrieval&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Feedback loops connected&lt;/strong&gt; — test output, lint results, and type check results fed back into context&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Pre-commit / pre-execution hooks added&lt;/strong&gt; — deterministic enforcement of your AGENTS.md constraints&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Observability instrumented&lt;/strong&gt; — token counts, costs, and traces per run&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context engineering for AI agents&lt;/strong&gt; is not a buzzword. It's the engineering discipline that determines whether your agent actually works in production — or confidently fails in circles, burning tokens and budget with nothing to show for it.&lt;/p&gt;

&lt;p&gt;The core insight, worth repeating: &lt;strong&gt;a decent model with a great harness beats a great model with a bad harness&lt;/strong&gt;. The benchmark data proves it. The production case studies confirm it. The gap between what today's models &lt;em&gt;can&lt;/em&gt; do and what most deployments &lt;em&gt;see them do&lt;/em&gt; is largely a harness gap.&lt;/p&gt;

&lt;p&gt;The discipline is maturing rapidly. HuggingFace launched a full certification course on it in June 2026. O'Reilly published practitioner-level guides on harness engineering. LangChain, Cline, and the major agent platforms are converging on shared vocabulary — scaffolding, harness, MCP, skills, sub-agents, ratchet patterns.&lt;/p&gt;

&lt;p&gt;The engineers who master context engineering in the next six months will build AI systems that compound in capability with every failure, rather than systems that plateau the moment the model reaches its limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where to go next:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📚 &lt;a href="https://huggingface.co/learn/context-course/en/unit0/introduction" rel="noopener noreferrer"&gt;HuggingFace Context Engineering Course&lt;/a&gt; — Free, 6 units, certification available&lt;/li&gt;
&lt;li&gt;🔧 &lt;a href="https://www.oreilly.com/radar/agent-harness-engineering/" rel="noopener noreferrer"&gt;Addy Osmani's Agent Harness Engineering&lt;/a&gt; — O'Reilly deep-dive&lt;/li&gt;
&lt;li&gt;🏗️ &lt;a href="https://github.com/langchain-ai/langgraph" rel="noopener noreferrer"&gt;LangGraph&lt;/a&gt; — Production multi-agent framework&lt;/li&gt;
&lt;li&gt;🔌 &lt;a href="https://github.com/modelcontextprotocol/python-sdk" rel="noopener noreferrer"&gt;MCP Python SDK&lt;/a&gt; — Official MCP server/client library&lt;/li&gt;
&lt;li&gt;📊 &lt;a href="https://smith.langchain.com" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt; — Observability and tracing for agent harnesses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The models are good. Make the harness great. Your agents will be unstoppable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Published on June 6, 2026 | Focus keyword: context engineering for AI agents&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>python</category>
      <category>webdev</category>
    </item>
    <item>
      <title>I'm Using Claude Code / Codex — Should I Still Use Hermes Agent or OpenClaw?</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Thu, 04 Jun 2026 10:55:25 +0000</pubDate>
      <link>https://dev.to/monuminu/im-using-claude-code-codex-should-i-still-use-hermes-or-openclaw-12jj</link>
      <guid>https://dev.to/monuminu/im-using-claude-code-codex-should-i-still-use-hermes-or-openclaw-12jj</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Already using Claude Code or Codex? Discover why Hermes Agent and OpenClaw belong in a completely different layer of your stack — and which one you should actually choose.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  I'm Using Claude Code / Codex — Should I Still Use Hermes Agent or OpenClaw?
&lt;/h1&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;p&gt;The Question Every Developer Is Asking Right Now&lt;br&gt;
Understanding the Landscape: Three Layers, Four Tools&lt;br&gt;
Claude Code &amp;amp; Codex: Two Sides of the Same Coin&lt;br&gt;
Hermes Agent: The Self-Improving Agent OS&lt;br&gt;
OpenClaw: The Personal AI OS That Started the Conversation&lt;br&gt;
Hermes Agent vs OpenClaw: A Real Head-to-Head&lt;br&gt;
The Core Insight: Claude Code/Codex Live Below Both&lt;br&gt;
When Should You Add an Agent OS?&lt;br&gt;
The Hybrid Architecture in Practice&lt;br&gt;
Cost &amp;amp; Privacy&lt;br&gt;
Your Stack, Assembled&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question Every Developer Is Asking Right Now
&lt;/h2&gt;

&lt;p&gt;You've got your workflow dialed in. Claude Code handles multi-file refactors while you review pull requests. Codex cranks through boilerplate. You're shipping faster than you were six months ago, and the productivity gains feel real. Then a thread blows up on X — someone raving about Hermes Agent or OpenClaw — and suddenly you're wondering whether your current setup is already obsolete.&lt;/p&gt;

&lt;p&gt;It isn't. But there's a subtlety here that most of those threads miss, and it matters a lot for how you architect your AI-assisted development environment.&lt;/p&gt;

&lt;p&gt;The short answer: &lt;strong&gt;Claude Code, Codex, Hermes Agent, and OpenClaw&lt;/strong&gt; are not competing for the same slot in your stack. Three of them live at fundamentally different layers of your tooling. And the fourth — well, that's where it gets interesting, because Hermes Agent and OpenClaw &lt;em&gt;are&lt;/em&gt; direct competitors, just not with Claude Code or Codex. Understanding that distinction is the lens through which everything else in this article becomes clear.&lt;/p&gt;

&lt;p&gt;Let's build that lens first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Landscape: Three Layers, Four Tools
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh24xjx12akdpyape6hrj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh24xjx12akdpyape6hrj.png" alt="Three-layer AI stack diagram: Model Layer, Coding Agent Layer, Agent OS Layer" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think of your AI stack the way you'd think about any well-layered software system: each tier has a specific responsibility, and mixing up those responsibilities causes confusion, not capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 is the model.&lt;/strong&gt; This is the raw intelligence — Claude Opus 4.8, GPT-5.3-Codex, Llama 3.1, whatever sits at the inference endpoint. Models don't have opinions about your project structure. They don't remember what you built last Tuesday. They don't ping you on Telegram when a test suite fails at 2 AM. They complete tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 is the coding agent.&lt;/strong&gt; Claude Code and OpenAI Codex live here. A coding agent wraps a model with coding-specific superpowers: the ability to traverse your codebase, write and execute diffs across multiple files simultaneously, open pull requests, integrate with GitHub Actions, and maintain enough context to understand what "the auth module" means without you explaining it every time. These tools are purpose-built for writing software. They are extremely good at their job. They are not, however, designed to be the orchestrating brain of your entire automated workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3 is the agent OS.&lt;/strong&gt; Hermes Agent and OpenClaw both live here — and this is where the real conversation happens. An agent OS isn't a model and isn't a coding tool. It's the coordinator: the thing that lives persistently across sessions, understands your long-term goals and context, dispatches specialized agents (including Layer 2 coding agents) as subprocesses, communicates with you through whatever messaging channels you use, and proactively acts on your behalf even when you haven't asked it to.&lt;/p&gt;

&lt;p&gt;When a developer asks "should I use Hermes Agent or OpenClaw if I'm already on Claude Code?" they're really asking two distinct questions compressed into one. The first is whether a Layer 3 tool adds value on top of Layer 2 tools — the answer is yes, for reasons we'll get into. The second is which Layer 3 tool to pick — and that's a genuine head-to-head between two competing platforms.&lt;/p&gt;

&lt;p&gt;That head-to-head deserves its own careful examination. But first, let's understand what you already have.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claude Code &amp;amp; Codex: Two Sides of the Same Coin
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rasx156lrmyxi1n28t8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1rasx156lrmyxi1n28t8.png" alt="Claude Code vs OpenAI Codex side-by-side comparison" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Developers sometimes talk about Claude Code and Codex as if they're fundamentally different categories of tool. In practice, they're far more similar than different — and recognising that similarity helps clarify what neither of them is.&lt;/p&gt;

&lt;p&gt;Both are managed, cloud-powered coding agents. Both integrate with your IDE (VS Code, JetBrains, Cursor, Windsurf). Both connect natively to GitHub. Both provide a CLI for terminal-first workflows. Both send your code to a remote inference provider for processing. Both require a paid subscription for serious use — Claude Code is included in the Anthropic Pro plan at $17/month (annual) or $20/month, while ChatGPT Plus at $20/month includes Codex access, and the Codex CLI is available as open-source Apache 2.0 on GitHub.&lt;/p&gt;

&lt;p&gt;Where they diverge is meaningful but narrower than the marketing suggests.&lt;/p&gt;

&lt;p&gt;Claude Code's headline capability is depth. It runs on Claude Opus 4.8 — a model with a one-million-token context window — making it exceptional for reasoning across genuinely large codebases where other tools start to lose coherence. The CLAUDE.md file gives you a powerful mechanism to encode project-level conventions, architectural rules, and team standards that persist across every session. Claude Code's permission-based architecture is strict by design: writes are isolated to the working directory, which matters in enterprise environments where security posture is non-negotiable. Anthropic backs this with SOC 2 Type 2 and ISO 27001 certification.&lt;/p&gt;

&lt;p&gt;Codex's distinguishing features lean in a different direction. The open-source CLI (MIT-adjacent Apache 2.0) means you can fork it, audit it, and extend it — a meaningful advantage for teams with compliance requirements or custom tooling needs. GPT-5.3-Codex is a purpose-built software engineering model trained specifically for code generation and transformation tasks, rather than a general reasoning model applied to code. Codex's explicit approval modes — Chat, Agent, and Agent with Full Access — give you a clean mental model for how much autonomy you're granting at any given moment. Cloud delegation allows you to offload long-running tasks to OpenAI's infrastructure without keeping a local process alive.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;OpenAI Codex&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary Model&lt;/td&gt;
&lt;td&gt;Claude Opus 4.8 / Sonnet 4.6&lt;/td&gt;
&lt;td&gt;GPT-5.3-Codex / GPT-5.4 / GPT-5.4-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Window&lt;/td&gt;
&lt;td&gt;1M tokens (Opus 4.8)&lt;/td&gt;
&lt;td&gt;Varies by model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CLI License&lt;/td&gt;
&lt;td&gt;Proprietary&lt;/td&gt;
&lt;td&gt;Apache 2.0 (open-source)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pricing Entry&lt;/td&gt;
&lt;td&gt;$17/month (Pro, annual)&lt;/td&gt;
&lt;td&gt;$20/month (ChatGPT Plus)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IDE Support&lt;/td&gt;
&lt;td&gt;VS Code, JetBrains, Cursor, Windsurf&lt;/td&gt;
&lt;td&gt;VS Code, JetBrains, Cursor, Windsurf&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitHub Native&lt;/td&gt;
&lt;td&gt;✅ Issue-to-PR automation&lt;/td&gt;
&lt;td&gt;✅ Cloud delegation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unique Feature&lt;/td&gt;
&lt;td&gt;CLAUDE.md, 1M context, SOC 2&lt;/td&gt;
&lt;td&gt;Purpose-built SWE model, open-source CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security Certifications&lt;/td&gt;
&lt;td&gt;SOC 2 Type 2, ISO 27001&lt;/td&gt;
&lt;td&gt;OpenAI enterprise standards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither Claude Code nor Codex, however, will ping you on Discord when your CI pipeline breaks. Neither maintains a memory of the architectural decision you made three weeks ago when you decided to migrate from REST to GraphQL. Neither can be configured to summarize your open PRs every morning and deliver the summary to Slack. Neither runs on a serverless backend that hibernates between tasks and costs you nothing when idle.&lt;/p&gt;

&lt;p&gt;Those are Layer 3 problems — and they're where Hermes Agent and OpenClaw enter the picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hermes Agent: The Self-Improving Agent OS
&lt;/h2&gt;

&lt;p&gt;Hermes Agent is built by Nous Research — the AI research lab responsible for the Hermes family of fine-tuned language models. But Hermes Agent itself is not a model. It's not a coding tool. It's an autonomous agent platform: an OS-layer coordinator designed to persist, learn, and act on your behalf across an indefinite number of sessions, platforms, and workstreams.&lt;/p&gt;

&lt;p&gt;The distinction is worth emphasising because it's easy to conflate Nous Research's model work with what Hermes Agent actually does. The research pedigree matters — it means the team understands model behavior at a deep level — but Hermes Agent's value proposition is architectural, not inferential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The learning loop is the most technically interesting part of the platform.&lt;/strong&gt; When you interact with Hermes Agent, it doesn't just respond and forget. It creates skills from experience: reusable, named procedures that get stored and improved across sessions. When a skill turns out to be imperfect, the system improves it during subsequent use. Periodic nudges surface relevant memories and past context at the right moments, rather than requiring you to manually provide background each time. Under the hood, full-text search via FTS5 enables cross-session recall with LLM-powered summarization, so the system can efficiently surface what's relevant from potentially months of interaction history. Honcho dialectic user modeling builds an evolving representation of your working style, preferences, and goals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbm1sv3qj8eo8shn456r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbm1sv3qj8eo8shn456r.png" alt="Hermes Agent self-improving learning loop diagram" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deployment story is equally distinctive.&lt;/strong&gt; Hermes Agent runs across six terminal backends: local process, Docker, SSH to a remote machine, Singularity containers, Modal, and Daytona. The last two are the most significant for production use — both are serverless environments where Hermes Agent hibernates when idle and costs you essentially nothing between active sessions. This means you're not choosing between "always-on server bill" and "manually restarting a local process." You get persistence without the cost of full-time compute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The messaging breadth is in a different category entirely from anything else in this space.&lt;/strong&gt; Twenty-plus platforms: Telegram, Discord, Slack, WhatsApp, Signal, Matrix, Mattermost, Email, SMS, DingTalk, Feishu, WeCom, Weixin, QQ Bot, Home Assistant, Microsoft Teams, Google Chat, and more. This matters because your communication stack isn't uniform — enterprise teams live in Teams and Slack, international teams use WeChat and DingTalk, personal developers prefer Telegram. Hermes Agent meets you where you are, rather than requiring you to meet it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model agnosticism is first-class, not an afterthought.&lt;/strong&gt; Nous Portal gives you access to 300-plus models plus web search, image generation, TTS, and a cloud browser under one subscription. But you can point Hermes Agent at OpenRouter (200-plus models), vanilla OpenAI, NovitaAI, NVIDIA NIM, Kimi/Moonshot, MiniMax, Hugging Face, or any custom endpoint. You are not locked to a single provider, and switching providers doesn't require a configuration overhaul.&lt;/p&gt;

&lt;p&gt;The tool surface is deep: 60-plus built-in tools covering web search, image generation, text-to-speech, browser automation, and code execution. MCP integration means you can connect any Model Context Protocol server and extend the tool surface further. Voice mode works in real-time across the CLI, Telegram, and Discord. The built-in cron scheduler enables proactive automations — morning briefings, automated error monitoring, PR summaries — delivered to whatever messaging platform you specify.&lt;/p&gt;

&lt;p&gt;For developers coming from a research or ML background, Hermes Agent also surfaces batch trajectory generation and trajectory compression tooling — directly useful for building fine-tuning datasets for tool-calling models. This isn't a side feature; it reflects Nous Research's core mission.&lt;/p&gt;

&lt;p&gt;Installation is a single curl command. &lt;code&gt;hermes setup --portal&lt;/code&gt; gets you to a working configuration. And if you're currently using OpenClaw, there's a built-in migration path: &lt;code&gt;hermes claw migrate&lt;/code&gt; imports your SOUL.md, memories, skills, API keys, messaging settings, command allowlist, and TTS assets. More on that when we compare the two directly.&lt;/p&gt;

&lt;p&gt;MIT licensed. Free platform cost. Model API is your only ongoing expense.&lt;/p&gt;




&lt;h2&gt;
  
  
  OpenClaw: The Personal AI OS That Started the Conversation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbr3a669eeqj1v7pg9rj3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbr3a669eeqj1v7pg9rj3.png" alt="OpenClaw orchestration flow: phone to Claude Code to PR" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;OpenClaw was built by Peter Steinberger (@steipete) — a longtime iOS developer and open-source contributor — and it represents a different philosophy: not a research platform, but a deeply personal AI OS that lives on your machine, speaks your language, and adapts to you specifically.&lt;/p&gt;

&lt;p&gt;Where Hermes Agent is broad and structured, OpenClaw is intimate and self-hackable.&lt;/p&gt;

&lt;p&gt;The architecture is deliberately local-first. Your context and memory live on your machine — not in a cloud provider's database, not in a walled garden. For developers with strong data sovereignty concerns, this is a fundamental differentiator. OpenClaw is model-agnostic across OpenAI-compatible APIs (Claude, GPT-5, and anything with a compatible endpoint), but the &lt;em&gt;memory and state&lt;/em&gt; are yours, not the provider's.&lt;/p&gt;

&lt;p&gt;The five core capabilities that define OpenClaw's character:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent memory&lt;/strong&gt; functions as a running journal of everything you've told it, every task it's completed, and every preference it's inferred. This isn't a novelty — it's what transforms an AI assistant from a stateless autocomplete into something that actually knows your stack, your preferences, and your current projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-buildable skills&lt;/strong&gt; are OpenClaw's most distinctive feature: the system can write its own plugins through conversation. You describe what you need, it writes the code, installs it, and begins using it. No plugin marketplace, no waiting for someone else to build it. The system extends itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proactive heartbeats&lt;/strong&gt; are cron-style jobs that fire without prompting — morning briefings, automated checks, reminders, error monitoring webhooks. This is what "proactive AI" actually means in practice: not waiting to be asked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-channel messaging&lt;/strong&gt; through Telegram, WhatsApp, and Discord puts OpenClaw in the platforms most personal developers actually live in. The interface isn't a dashboard or a web app — it's a chat thread that happens to have the ability to write code, open PRs, and manage your infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent coordination&lt;/strong&gt; is where the connection to Claude Code and Codex becomes real. OpenClaw can dispatch Claude Code and Codex sessions as subagents, monitor their progress, and surface results — all without you touching a keyboard. The user testimonials here are more illustrative than any architectural diagram:&lt;/p&gt;

&lt;p&gt;One developer described their setup as "managing Claude Code / Codex sessions I can kick off anywhere, autonomously running tests on my app and capturing errors through a Sentry webhook then resolving them and opening PRs." Another: "I'm literally on my phone in a Telegram chat and it's communicating with Codex CLI on my computer creating detailed spec files while out on a walk with my dog." A third: "your context and skills live on YOUR computer, not a walled garden. It's open source."&lt;/p&gt;

&lt;p&gt;This is the feel of OpenClaw. It's less enterprise infrastructure and more digital employee — one that happens to have access to every AI tool you own and lives in your pocket.&lt;/p&gt;

&lt;p&gt;The tradeoffs are real. OpenClaw's platform integrations are limited to three messaging channels. There's no serverless deployment option — it runs on your machine, whether that's a Raspberry Pi or a Mac Studio. Setup requires more manual configuration than Hermes Agent's one-command approach. And the platform is newer, with a rapidly evolving feature set that means both more excitement and more rough edges.&lt;/p&gt;

&lt;p&gt;But for developers who want simplicity, full local ownership, and a platform they can genuinely understand and modify, OpenClaw has a compelling case.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hermes Agent vs OpenClaw: A Real Head-to-Head
&lt;/h2&gt;

&lt;p&gt;This is the comparison that actually matters for most developers reading this — both Hermes Agent and OpenClaw occupy the same layer of the stack, solve overlapping problems, and are positioned as alternatives to each other. Choosing between them is a real architectural decision, not a trick question.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07i44ghw0liqbdqy9gec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07i44ghw0liqbdqy9gec.png" alt="Hermes Agent vs OpenClaw feature comparison grid" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Hermes Agent&lt;/th&gt;
&lt;th&gt;OpenClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Messaging Platforms&lt;/td&gt;
&lt;td&gt;20+ (Telegram, Discord, Slack, WhatsApp, Signal, Teams, Matrix, DingTalk, Feishu, and more)&lt;/td&gt;
&lt;td&gt;3 (Telegram, WhatsApp, Discord)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning Loop&lt;/td&gt;
&lt;td&gt;✅ Autonomous skill creation + improvement + FTS5 recall + Honcho user modeling&lt;/td&gt;
&lt;td&gt;✅ Self-builds skills via conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment&lt;/td&gt;
&lt;td&gt;Local, Docker, SSH, Daytona (serverless), Modal (serverless), Singularity&lt;/td&gt;
&lt;td&gt;Local (your machine or VPS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serverless Option&lt;/td&gt;
&lt;td&gt;✅ Yes (Daytona/Modal — hibernates when idle)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MCP Integration&lt;/td&gt;
&lt;td&gt;✅ Yes (any MCP server)&lt;/td&gt;
&lt;td&gt;❌ No native MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice Mode&lt;/td&gt;
&lt;td&gt;✅ Yes (CLI, Telegram, Discord)&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built-in Tools&lt;/td&gt;
&lt;td&gt;60+ (web search, image gen, TTS, browser, code exec)&lt;/td&gt;
&lt;td&gt;Community-built skills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Provider&lt;/td&gt;
&lt;td&gt;Nous Portal (300+ models), OpenRouter (200+), OpenAI, any endpoint&lt;/td&gt;
&lt;td&gt;Claude, GPT-5, any OpenAI-compatible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migrate from OpenClaw&lt;/td&gt;
&lt;td&gt;✅ &lt;code&gt;hermes claw migrate&lt;/code&gt; (full import)&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Built By&lt;/td&gt;
&lt;td&gt;Nous Research (AI research lab)&lt;/td&gt;
&lt;td&gt;@steipete (indie developer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hermes setup --portal&lt;/code&gt; (one command)&lt;/td&gt;
&lt;td&gt;More manual configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maturity&lt;/td&gt;
&lt;td&gt;Structured, comprehensive docs&lt;/td&gt;
&lt;td&gt;Rapidly evolving, community-led&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Power users, multi-platform, serverless, enterprise-adjacent&lt;/td&gt;
&lt;td&gt;Personal use, local ownership, self-hacking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The comparison reveals two distinct philosophies rather than one clearly superior option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose Hermes Agent when&lt;/strong&gt; your requirements extend beyond a single personal machine. If you need to receive notifications across Slack, Teams, and Telegram simultaneously, or if you're coordinating AI workstreams for a small team rather than solo development, or if you want serverless deployment that doesn't bill you for idle compute, or if you need structured MCP integration to connect existing tooling — Hermes Agent is the more capable platform. The 60-plus built-in tools and voice mode reduce the time you spend configuring and increase the time the system is actually working. The research-grade trajectory tooling is irrelevant to most developers, but if you're building fine-tuning datasets or evaluating agent behavior systematically, it's uniquely valuable. And if you're currently on OpenClaw and outgrowing it, &lt;code&gt;hermes claw migrate&lt;/code&gt; makes the transition a command rather than a project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choose OpenClaw when&lt;/strong&gt; simplicity and local ownership are genuinely your priorities. If you want a system you can fully understand and modify yourself, if you want your data to stay on your hardware, if three messaging platforms cover your needs, and if you prefer the experience of a single personal AI assistant over a multi-platform coordination layer — OpenClaw delivers that experience with less friction. It's also a legitimate choice if you're new to agent OS concepts and want to learn the paradigm before committing to a more complex platform. The community is active, the founder is accessible, and the self-hackable architecture means your investment in customisation compounds over time.&lt;/p&gt;

&lt;p&gt;The existential question for OpenClaw users is growth. As your automation needs expand — more platforms, more integrations, more persistent services — you'll either extend OpenClaw's capabilities through custom skills or you'll find yourself reaching for something with more built-in breadth. Hermes Agent's migration path exists precisely because this transition is predictable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Insight: Claude Code/Codex Live Below Both
&lt;/h2&gt;

&lt;p&gt;Let's return to the original question with the architecture firmly in mind.&lt;/p&gt;

&lt;p&gt;Claude Code and Codex are Layer 2 tools. They are excellent at what they do — executing complex coding tasks with deep model intelligence and IDE integration. But they have no concept of "your broader workflow." They don't have a persistent sense of your goals. They don't dispatch themselves. They wait for instructions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvep6mn9omwljy3zdexb6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvep6mn9omwljy3zdexb6.png" alt="Vertical stack architecture: Agent OS dispatches Coding Agents dispatch Models" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hermes Agent and OpenClaw are Layer 3 tools. They maintain state, model your goals, coordinate across platforms, and — crucially — dispatch Layer 2 tools as subagents when a coding task is required. Hermes Agent can spin up a Claude Code session to handle a refactor while simultaneously coordinating a Codex task for test generation, then collect both results and surface them as a single Slack message. OpenClaw can receive a Telegram message from you at noon, dispatch Claude Code to fix a failing test, monitor the result, and send you the PR link before you finish lunch.&lt;/p&gt;

&lt;p&gt;The framing of "should I use Hermes/OpenClaw &lt;em&gt;instead of&lt;/em&gt; Claude Code" is therefore a category error. Layer 3 tools use Layer 2 tools. Adding an agent OS to your stack doesn't replace your coding agents — it gives them a coordinator.&lt;/p&gt;

&lt;p&gt;Your Claude Code and Codex configurations, your CLAUDE.md, your Codex approval modes — all of that remains exactly as you've built it. The agent OS layer simply gains the ability to dispatch those tools on your behalf, without you being at the keyboard.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Should You Add an Agent OS?
&lt;/h2&gt;

&lt;p&gt;The honest answer is that an agent OS isn't for every developer at every stage. Here's how to self-assess:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong signals to add Hermes Agent or OpenClaw:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want to kick off Claude Code or Codex tasks from your phone, without being at your workstation&lt;/li&gt;
&lt;li&gt;You need persistent memory that survives session boundaries — context that accumulates over weeks and months, not minutes&lt;/li&gt;
&lt;li&gt;You want proactive AI behavior: morning briefings, automated error monitoring, scheduled reports, cron-triggered workflows&lt;/li&gt;
&lt;li&gt;You're coordinating multiple AI tools (coding agents, search tools, image generation, external APIs) and want a single orchestration layer&lt;/li&gt;
&lt;li&gt;You want to fully own your AI stack and not be dependent on any single provider's roadmap&lt;/li&gt;
&lt;li&gt;You're ready to invest a few hours in setup to get months of compounded productivity return&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reasonable signals to wait:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your current workflow is primarily interactive coding assistance, and you're at the keyboard when you need help — Claude Code or Codex alone is genuinely sufficient for this&lt;/li&gt;
&lt;li&gt;You want zero setup overhead and are optimising for time-to-first-output over automation depth&lt;/li&gt;
&lt;li&gt;You're still learning the capabilities of your Layer 2 tools and aren't yet hitting their coordination limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The developers who benefit most from a Layer 3 platform are those who've already extracted significant value from Claude Code or Codex and are starting to feel the friction of manual coordination — the mental overhead of managing multiple sessions, the missed opportunities that happen when you're away from the keyboard, the context rebuilding that happens every time a new session starts.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hybrid Architecture in Practice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3w9eld76k0r51a5an8s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm3w9eld76k0r51a5an8s.png" alt="Hub-and-spoke hybrid architecture with Hermes Agent/OpenClaw at center" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Abstract architecture only goes so far. Here are three concrete workflows that reflect how developers are actually using these stacks today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow one: Fix failing tests, hands-free.&lt;/strong&gt; Your CI pipeline reports a test failure. Hermes Agent — configured with a webhook from your CI provider — receives the notification, classifies the failure type, dispatches a Claude Code session with the relevant file context and the specific failing assertion, monitors the session until a fix is produced and committed, and sends you a Telegram message with the PR link. You see the result; you never touch a keyboard. This is the workflow @nateliason described — "autonomously running tests on my app and capturing errors through a Sentry webhook then resolving them and opening PRs." Same architecture, whether the coordinator is Hermes Agent or OpenClaw.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow two: Morning PR review briefing.&lt;/strong&gt; Hermes Agent's cron scheduler fires at 7:45 AM. It pulls the list of open PRs from your GitHub repositories, dispatches Codex to summarise the diff and flag any risky changes in each PR, collects the summaries, and delivers a structured briefing to your team's Slack channel before standup. Nobody configured this manually each day. Nobody woke up early to run it. It just happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Workflow three: Build a new capability through conversation.&lt;/strong&gt; You notice you keep asking your agent to check a specific internal dashboard for a particular metric. Rather than doing this manually every time, you tell Hermes Agent or OpenClaw: "Build a skill that checks [dashboard URL] for [metric name] and reports it when I ask." The platform — using its self-building capability — writes the tool, installs it, and confirms it's ready. Next time you ask for that metric, it's a single natural language request away. The system has made itself more capable without a human writing a plugin.&lt;/p&gt;

&lt;p&gt;These aren't hypothetical flows. They're the patterns that keep appearing in the testimonials from real users, and they're the reason why Layer 3 tools have captured significant developer attention even among people who were already satisfied with their Layer 2 setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost &amp;amp; Privacy
&lt;/h2&gt;

&lt;p&gt;Cost architecture is often what determines long-term viability, so it deserves an honest look.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack Configuration&lt;/th&gt;
&lt;th&gt;Platform Cost&lt;/th&gt;
&lt;th&gt;Model Cost&lt;/th&gt;
&lt;th&gt;Privacy Profile&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code Pro only&lt;/td&gt;
&lt;td&gt;$17–20/month&lt;/td&gt;
&lt;td&gt;Included in Pro tier limits&lt;/td&gt;
&lt;td&gt;Code sent to Anthropic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex (ChatGPT Plus) only&lt;/td&gt;
&lt;td&gt;$20/month&lt;/td&gt;
&lt;td&gt;Included in Plus tier limits&lt;/td&gt;
&lt;td&gt;Code sent to OpenAI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code + Hermes Agent (Nous Portal)&lt;/td&gt;
&lt;td&gt;$17–20/month + Nous Portal subscription&lt;/td&gt;
&lt;td&gt;Portal covers 300+ models&lt;/td&gt;
&lt;td&gt;Code sent to Anthropic; memory stays local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code + OpenClaw + local model&lt;/td&gt;
&lt;td&gt;$17–20/month&lt;/td&gt;
&lt;td&gt;API usage at current rates&lt;/td&gt;
&lt;td&gt;Code sent to Anthropic; memory on your machine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent + local model (full open)&lt;/td&gt;
&lt;td&gt;Free platform&lt;/td&gt;
&lt;td&gt;Local inference cost only&lt;/td&gt;
&lt;td&gt;Fully private — no third-party cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent OS platforms themselves — Hermes Agent (MIT) and OpenClaw (open source) — are free. Your costs are entirely model API fees. At the high end of usage, Claude Opus 4.8 is priced at $5 per million input tokens and $25 per million output tokens; Claude Sonnet 4.6 at $3/$15 per million tokens. &lt;em&gt;Verify current pricing at the provider's official documentation before budgeting, as these rates change.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The privacy picture is worth thinking through carefully. When you use Claude Code, your code goes to Anthropic. When you use Codex, your code goes to OpenAI. This is true regardless of which agent OS is doing the dispatching. The agent OS layer itself — your persistent memory, your skills, your conversation history — can remain entirely on your machine with both Hermes Agent (when self-hosted) and OpenClaw (by design).&lt;/p&gt;

&lt;p&gt;The maximally private configuration is Hermes Agent running locally against a self-hosted model. No code leaves your infrastructure at any layer. For most developers, this level of isolation isn't necessary — but for those working in regulated industries or with sensitive IP, it's worth knowing the option exists.&lt;/p&gt;




&lt;h2&gt;
  
  
  Your Stack, Assembled
&lt;/h2&gt;

&lt;p&gt;Let's answer the question in the title directly, and then give you the decision matrix to make it actionable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Should you still use Hermes Agent or OpenClaw if you're on Claude Code or Codex?&lt;/strong&gt; Yes — because they operate at different layers and are additive rather than competitive. Claude Code and Codex are the hands that do the coding work. Hermes Agent and OpenClaw are the coordinators that decide when to use those hands, what to do with the results, and how to surface them to you wherever you happen to be.&lt;/p&gt;

&lt;p&gt;The more interesting question — Hermes Agent &lt;em&gt;versus&lt;/em&gt; OpenClaw — is a genuine choice between two competing platforms at the same layer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Use When&lt;/th&gt;
&lt;th&gt;Skip When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code&lt;/td&gt;
&lt;td&gt;Layer 2 — Coding Agent&lt;/td&gt;
&lt;td&gt;Deep reasoning, large codebases, 1M context, enterprise security requirements&lt;/td&gt;
&lt;td&gt;You only need lightweight completions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Codex&lt;/td&gt;
&lt;td&gt;Layer 2 — Coding Agent&lt;/td&gt;
&lt;td&gt;Open-source CLI needed, GPT-5.3-Codex SWE model, explicit approval control&lt;/td&gt;
&lt;td&gt;You're fully invested in Anthropic ecosystem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hermes Agent&lt;/td&gt;
&lt;td&gt;Layer 3 — Agent OS&lt;/td&gt;
&lt;td&gt;Multi-platform (20+ channels), serverless deployment, MCP integration, voice, structured learning loop, power user workflows, migrating from OpenClaw&lt;/td&gt;
&lt;td&gt;You need the absolute simplest possible setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;td&gt;Layer 3 — Agent OS&lt;/td&gt;
&lt;td&gt;Local-first data ownership, personal feel, self-hackable architecture, three messaging platforms are enough&lt;/td&gt;
&lt;td&gt;You need enterprise breadth, serverless, or MCP&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're choosing a Layer 3 platform for the first time: OpenClaw offers the lower-friction entry point and a genuinely personal AI OS experience. If you need platform breadth, serverless deployment, voice mode, MCP support, or 60-plus built-in tools from day one: Hermes Agent is the more mature and capable platform. And if you start with OpenClaw and later need more — &lt;code&gt;hermes claw migrate&lt;/code&gt; makes the transition mechanical rather than painful.&lt;/p&gt;

&lt;p&gt;The developers shipping fastest right now aren't using one AI tool instead of another. They're stacking deliberately: a model that can reason, a coding agent that can execute, and an agent OS that can coordinate. The question was never "Hermes Agent &lt;em&gt;or&lt;/em&gt; Claude Code." The question is what you're building at each layer — and whether the pieces you've chosen are actually talking to each other.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/nousresearch/hermes-agent" rel="noopener noreferrer"&gt;Hermes Agent on GitHub&lt;/a&gt; — MIT licensed, Nous Research&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openclaw.ai" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt; — open-source personal AI OS by @steipete&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.anthropic.com/en/docs/claude-code" rel="noopener noreferrer"&gt;Claude Code documentation&lt;/a&gt; — Anthropic&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/openai/codex" rel="noopener noreferrer"&gt;OpenAI Codex CLI on GitHub&lt;/a&gt; — Apache 2.0&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>webdev</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Automating AI Agents at Scale: A Deep Dive into Azure AI Foundry Routines with Python</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Thu, 04 Jun 2026 08:17:13 +0000</pubDate>
      <link>https://dev.to/monuminu/automating-ai-agents-at-scale-a-deep-dive-into-azure-ai-foundry-routines-with-python-2n30</link>
      <guid>https://dev.to/monuminu/automating-ai-agents-at-scale-a-deep-dive-into-azure-ai-foundry-routines-with-python-2n30</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Learn how to build production-grade scheduled and time-triggered AI agent workflows using Azure AI Foundry Routines and Python. Deep dive into trigger types, dispatch mechanics, retry policies, run observability, and architectural best practices for Azure AI engineers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3d5qj7fy7p58yw7fwh44.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3d5qj7fy7p58yw7fwh44.png" alt="Azure AI agents connected to a central cloud scheduler with calendar and clock icons on a dark azure-blue background" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Azure AI Foundry Routines — intelligent agents orchestrated on your schedule&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction: Your Agents Shouldn't Need a Babysitter&lt;/li&gt;
&lt;li&gt;What Are Azure AI Foundry Routines?&lt;/li&gt;
&lt;li&gt;Prerequisites &amp;amp; Environment Setup&lt;/li&gt;
&lt;li&gt;
Trigger Types Deep Dive

&lt;ul&gt;
&lt;li&gt;4.1 Schedule Trigger — Recurring Cron
&lt;/li&gt;
&lt;li&gt;4.2 Timer Trigger — One-Shot Execution
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Action Types: Responses API vs. Invocations API&lt;/li&gt;
&lt;li&gt;Routine Lifecycle Management&lt;/li&gt;
&lt;li&gt;Manual Dispatch &amp;amp; Testing&lt;/li&gt;
&lt;li&gt;Run Observability &amp;amp; History&lt;/li&gt;
&lt;li&gt;Retry Policy &amp;amp; Dispatch Behavior (Expert Corner)&lt;/li&gt;
&lt;li&gt;Production Best Practices &amp;amp; Architectural Patterns&lt;/li&gt;
&lt;li&gt;Known Limitations &amp;amp; What's Coming&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  1. Introduction: Your Agents Shouldn't Need a Babysitter
&lt;/h2&gt;

&lt;p&gt;Here's a scenario that's become all too familiar for Azure AI engineers: you've built a powerful AI agent — it summarizes overnight telemetry, generates weekly compliance reports, or fires contextual alerts based on operational data. The model is solid. The tooling is wired. The prompt is perfect. And yet, every morning, someone on the team has to manually kick it off. Maybe it's a cron job bolted onto a VM, maybe it's a Logic App invoking a Function invoking an endpoint, maybe it's just someone at 7 AM typing into a chat interface. In all cases, the agent itself is autonomous. The &lt;em&gt;orchestration&lt;/em&gt; isn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Azure AI Foundry Routines&lt;/strong&gt; change that equation entirely. Introduced as a first-class preview feature in the Azure AI Foundry platform, Routines let you bind a time-based trigger directly to an agent invocation — no external scheduler, no glue infrastructure, no babysitter required. You define &lt;em&gt;when&lt;/em&gt; (a cron expression or a one-shot timestamp) and &lt;em&gt;what&lt;/em&gt; (which agent to invoke and through which API pathway), and Foundry handles everything else: queueing the run, tracking its state, recording the outcome, and retrying on transient failures.&lt;/p&gt;

&lt;p&gt;This post is a complete L500 deep dive. We'll go well beyond the "hello world" of creating a routine and into the mechanics that matter in production: trigger semantics and IANA timezone pitfalls, the architectural difference between &lt;code&gt;invoke_agent_responses_api&lt;/code&gt; and &lt;code&gt;invoke_agent_invocations_api&lt;/code&gt;, dispatch acknowledgment vs. completion guarantees, retry policy internals, run observability patterns, and multi-routine composition strategies. All examples are in Python using the &lt;code&gt;azure-ai-projects&lt;/code&gt; SDK.&lt;/p&gt;

&lt;p&gt;Let's build something that actually runs itself.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. What Are Azure AI Foundry Routines?
&lt;/h2&gt;

&lt;p&gt;At its core, an &lt;strong&gt;Azure AI Foundry Routine&lt;/strong&gt; is a named automation rule that lives on the Foundry data plane of your project. Its anatomy is simple but powerful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Routine = Trigger (when) + Action (what) + Run Record (observable history)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a trigger fires — either on a recurring cron schedule or at a specific point in time — Foundry enqueues an agent invocation asynchronously. The delivery worker picks up the message, calls the downstream agent API, and writes a run record you can inspect programmatically or via the Foundry portal. The routine itself is stateless between runs; state continuity is managed at the agent level through &lt;code&gt;conversation_id&lt;/code&gt; or &lt;code&gt;session_id&lt;/code&gt; fields in the action definition.&lt;/p&gt;

&lt;p&gt;This is a meaningful architectural shift. Traditionally, scheduled AI workloads required external orchestrators (Azure Scheduler, Logic Apps, ADF pipelines) that treated agent invocation as a side-effect of a larger workflow. With Routines, the scheduling primitive is &lt;em&gt;colocated with the agent platform itself&lt;/em&gt;, which means tighter integration, first-class observability, and zero bootstrapping overhead.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key mental model:&lt;/strong&gt; A Routine is not a workflow engine. It's a dispatch primitive — a lightweight, durable binding between a time signal and an agent endpoint, with built-in retry semantics and a queryable run log.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;When&lt;/em&gt; to fire — &lt;code&gt;schedule&lt;/code&gt; (cron) or &lt;code&gt;timer&lt;/code&gt; (one-shot)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;What&lt;/em&gt; to invoke — agent via Responses API or Invocations API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Run Record&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;em&gt;Observable history&lt;/em&gt; — phase, timestamps, error details, dispatch IDs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  3. Prerequisites &amp;amp; Environment Setup
&lt;/h2&gt;

&lt;p&gt;Before writing a single line of routine code, make sure the following conditions are met.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regional availability (preview):&lt;/strong&gt; Routines are currently available only in these Azure regions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;East US / East US 2&lt;/li&gt;
&lt;li&gt;West US / West US 2 / West Central US / North Central US&lt;/li&gt;
&lt;li&gt;Sweden Central&lt;/li&gt;
&lt;li&gt;Japan East&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your Foundry project is provisioned outside these regions, the &lt;strong&gt;Routines&lt;/strong&gt; menu will not appear in the portal and SDK calls will fail. Confirm your project region before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RBAC:&lt;/strong&gt; Your identity (user or service principal) must hold the &lt;strong&gt;Foundry User&lt;/strong&gt; role or higher on the project scope. Note that this role was recently renamed from &lt;em&gt;Azure AI User&lt;/em&gt; — the role ID and permissions are unchanged, but portal displays may still show the old name during rollout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent identity requirement:&lt;/strong&gt; Any agent bound to a routine action must have a configured &lt;strong&gt;agent identity&lt;/strong&gt;. Prompt-only agents are explicitly rejected by the service at routine creation time — not at invocation time, so this surfaces early.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SDK installation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the Azure AI Projects SDK (v2.2.0+ required for routines + duration shorthand)&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"azure-ai-projects&amp;gt;=2.2.0"&lt;/span&gt;

&lt;span class="c"&gt;# Install Azure Identity for credential management&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;azure-identity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;&amp;gt;=2.2.0&lt;/code&gt; specifically?&lt;/strong&gt; Version 2.2.0 introduced the &lt;code&gt;client.beta.routines&lt;/code&gt; namespace &lt;em&gt;and&lt;/em&gt; the duration shorthand syntax for timer triggers (&lt;code&gt;"30m"&lt;/code&gt;, &lt;code&gt;"2h"&lt;/code&gt;). Earlier versions will not have the &lt;code&gt;routines&lt;/code&gt; attribute on the &lt;code&gt;beta&lt;/code&gt; client.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Client initialization:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="c1"&gt;# Set via environment for portability across local dev, CI, and Azure-hosted workloads
# Format: https://&amp;lt;account&amp;gt;.services.ai.azure.com/api/projects/&amp;lt;project&amp;gt;
&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# DefaultAzureCredential handles: local az login, managed identity, workload identity
# No explicit token management required — the SDK handles the Bearer flow internally
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Verify connectivity — list existing routines (returns empty iterator if none exist)
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Connected. Existing routines:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  enabled=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production tip:&lt;/strong&gt; In AKS or Azure Container Apps workloads, use workload identity federation instead of a managed identity secret. &lt;code&gt;DefaultAzureCredential&lt;/code&gt; resolves this automatically when the &lt;code&gt;AZURE_CLIENT_ID&lt;/code&gt; environment variable is injected by the workload identity webhook.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Trigger Types Deep Dive
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry Routines support exactly two trigger types in the current preview. Understanding their behavioral semantics — not just their syntax — is critical for production correctness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc38m9nckwhk37skn76g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcc38m9nckwhk37skn76g.png" alt="Comparison diagram showing Schedule Trigger with a clock on the left and One-Shot Timer Trigger on the right" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Schedule triggers recur indefinitely on a cron expression; timer triggers fire exactly once&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  4.1 Schedule Trigger — Recurring Cron
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;schedule&lt;/code&gt; trigger fires repeatedly based on a &lt;strong&gt;5-field cron expression&lt;/strong&gt;. The service enforces a hard minimum interval of &lt;strong&gt;five minutes&lt;/strong&gt; — cron expressions that resolve to a sub-five-minute cadence are rejected at creation time with a 400 error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cron field order:&lt;/strong&gt; &lt;code&gt;minute hour day-of-month month day-of-week&lt;/code&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cron Expression&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0 7 * * 1-5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every weekday at 07:00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;*/15 9-17 * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every 15 min between 09:00–17:00 daily&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0 0 1 * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;First of every month at midnight&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;0 */6 * * *&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Every 6 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;time_zone&lt;/code&gt; field is &lt;strong&gt;required&lt;/strong&gt; and accepts any IANA timezone identifier (e.g., &lt;code&gt;America/Los_Angeles&lt;/code&gt;, &lt;code&gt;Europe/Stockholm&lt;/code&gt;, &lt;code&gt;Asia/Tokyo&lt;/code&gt;) or Windows timezone names.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Timezone pitfall:&lt;/strong&gt; The Foundry portal interprets trigger time in your &lt;em&gt;browser's local timezone&lt;/em&gt;. For production routines, always create them programmatically via the SDK or REST API and explicitly set &lt;code&gt;time_zone&lt;/code&gt; to an IANA zone. This eliminates any ambiguity from DST transitions or user locale.&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# --- Schedule Trigger: Recurring Cron ---
# Fires every weekday at 07:00 UTC
# cron_expression and time_zone are BOTH required fields
&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Runs the ops-summary agent every weekday morning at 07:00 UTC.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;# Key is a logical name for the trigger — must be unique within the routine
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekday-morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 7 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Required: 5-field cron, min 5-min interval
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="c1"&gt;# Required: IANA or Windows TZ identifier
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;# Required: project-scoped agent name (max 256 chars)
&lt;/span&gt;        &lt;span class="c1"&gt;# "conversation_id": "conv-abc123",       # Optional: continue an existing conversation thread
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Routine &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; created.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Enabled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Triggers: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Created at: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;One trigger per routine (preview constraint):&lt;/strong&gt; The &lt;code&gt;triggers&lt;/code&gt; map supports exactly one entry in V1Preview. To fan out to multiple schedules for the same agent, create separate routines.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  4.2 Timer Trigger — One-Shot Execution
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;timer&lt;/code&gt; trigger fires &lt;strong&gt;exactly once&lt;/strong&gt; at a future point in time. It's the right primitive for release-day tasks, scheduled data migrations, model warm-up before a known traffic spike, or delayed execution after a deployment.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;at&lt;/code&gt; field accepts three formats:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ISO 8601 with UTC offset&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"2026-09-01T09:00:00Z"&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Absolute time, UTC pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local timestamp + &lt;code&gt;time_zone&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"2026-09-01T09:00:00"&lt;/code&gt; + &lt;code&gt;"America/New_York"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Business-time semantics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duration from now&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;"30m"&lt;/code&gt;, &lt;code&gt;"2h"&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Relative scheduling, post-deploy triggers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# --- Timer Trigger: Absolute ISO 8601 ---
# Fires once on September 1st 2026 at 09:00 UTC
&lt;/span&gt;&lt;span class="n"&gt;release_day_routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;release-day-announcement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fires the release-announcement agent once on go-live day.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;release-day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-09-01T09:00:00Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Required: future ISO 8601 timestamp (UTC)
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Release day routine scheduled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;release_day_routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Timer Trigger: Duration Shorthand (requires azure-ai-projects &amp;gt;= 2.2.0) ---
# Fires 30 minutes from now — useful for post-deploy warm-up or deferred execution
&lt;/span&gt;&lt;span class="n"&gt;deferred_routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post-deploy-warmup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Runs model warm-up agent 30 minutes after deployment completes.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warmup-delay&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# Duration shorthand: fires 30 minutes from routine creation
&lt;/span&gt;        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_invocations_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Use Invocations API for session continuity
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# "session_id": "sess-xyz",              # Optional: continue an existing session
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Post-deploy warmup scheduled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;deferred_routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Timer routines fire exactly once and do not reschedule themselves. After the trigger fires, the routine remains in the system with its run record but will not fire again.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Action Types: Responses API vs. Invocations API
&lt;/h2&gt;

&lt;p&gt;Choosing the wrong action type is one of the subtler production mistakes with Routines. Both types invoke an agent, but they differ in &lt;strong&gt;state model, session scope, and continuation semantics&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;&lt;code&gt;invoke_agent_responses_api&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;invoke_agent_invocations_api&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API pathway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Responses API&lt;/td&gt;
&lt;td&gt;Invocations API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State continuity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;conversation_id&lt;/code&gt; — continues a conversation thread&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;session_id&lt;/code&gt; — continues a hosted-agent session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Optional context field&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;conversation_id&lt;/code&gt; (max 256 chars)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;session_id&lt;/code&gt; (max 256 chars)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stateless runs or conversation continuity&lt;/td&gt;
&lt;td&gt;Session-scoped, stateful agent invocations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Action Type Comparison ---
&lt;/span&gt;
&lt;span class="c1"&gt;# Option A: Responses API — stateless OR conversation-threaded
&lt;/span&gt;&lt;span class="n"&gt;responses_action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Required: project-scoped name
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conversation_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conv-abc123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Optional: pass to continue an existing thread
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Option B: Invocations API — session-scoped stateful agent
&lt;/span&gt;&lt;span class="n"&gt;invocations_action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_invocations_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Required: endpoint-scoped name
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sess-xyz456&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# Optional: continue an existing hosted session
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create two variants of the same routine with different action pathways
&lt;/span&gt;&lt;span class="n"&gt;responses_routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary-via-responses&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily summary using Responses API — stateless per run.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 8 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;responses_action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;invocations_routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary-via-invocations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily summary using Invocations API — maintains session state.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 8 * * *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;invocations_action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Decision heuristic:&lt;/strong&gt; Use &lt;code&gt;invoke_agent_responses_api&lt;/code&gt; for &lt;strong&gt;stateless or conversation-threaded&lt;/strong&gt; workloads (daily reports, alerts, summaries). Use &lt;code&gt;invoke_agent_invocations_api&lt;/code&gt; when your agent relies on &lt;strong&gt;session-scoped memory or tool state&lt;/strong&gt; that must persist across scheduled invocations.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  6. Routine Lifecycle Management
&lt;/h2&gt;

&lt;p&gt;Routines support a full lifecycle: creation (via &lt;code&gt;create_or_update&lt;/code&gt;), pause/resume, update, and deletion. Understanding the semantics of each operation matters when managing a fleet of routines in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;agent_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# --- DISABLE: Pauses the routine without deleting it ---
# Useful for incident response — pause a misfiring routine without losing its config
&lt;/span&gt;&lt;span class="n"&gt;disabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;disable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After disable — enabled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;disabled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False
&lt;/span&gt;
&lt;span class="c1"&gt;# --- ENABLE: Re-activates a paused routine ---
&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After enable — enabled: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# True
&lt;/span&gt;
&lt;span class="c1"&gt;# --- UPDATE: Full replace semantics ---
# IMPORTANT: Omitted optional fields reset to their defaults — always supply the full definition
&lt;/span&gt;&lt;span class="n"&gt;updated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Updated: shifted to 08:00 UTC post-DST review.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekday-morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 8 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Shifted from 07:00 to 08:00
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Updated at: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;updated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- LIST: Enumerate all routines in the project ---
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  enabled=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  triggers=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- GET: Retrieve a specific routine by name ---
&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fetched: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, action_type=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- DELETE: Removes the routine and stops all future trigger deliveries ---
# Existing run records are PRESERVED after deletion
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post-deploy-warmup&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;post-deploy-warmup deleted. Run history preserved.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Update semantics — the full-replace trap:&lt;/strong&gt; &lt;code&gt;create_or_update&lt;/code&gt; performs a complete replacement. There is no partial PATCH in V1Preview. Always supply the complete definition on every update. A common pattern is to &lt;code&gt;get()&lt;/code&gt; first, mutate the fields you need, and pass the full object back.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. Manual Dispatch &amp;amp; Testing
&lt;/h2&gt;

&lt;p&gt;Before relying on scheduled triggers in production, validate that a routine correctly reaches your agent. The &lt;code&gt;dispatch&lt;/code&gt; operation queues a one-off run immediately, bypassing the trigger schedule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# --- Manual dispatch for a Responses API routine ---
# The payload 'type' must match the routine's action type exactly
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Optional: override the routine's configured prompt for this test run only
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a concise test summary for the last 15 minutes of telemetry.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# dispatch() returns immediately — the run is enqueued, not completed
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dispatch_id:           &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dispatch_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_correlation_id: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action_correlation_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id:               &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Poll run history to confirm completion ---
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Waiting for run to complete...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Allow time for async delivery
&lt;/span&gt;
&lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latest run: id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  phase=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  source=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attempt_source&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ERROR: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Production Warning:&lt;/strong&gt; Only use &lt;code&gt;dispatch&lt;/code&gt; (which maps to the &lt;code&gt;:dispatch_async&lt;/code&gt; route) for manual runs. The legacy &lt;code&gt;:dispatch&lt;/code&gt; route is &lt;strong&gt;not&lt;/strong&gt; part of the public contract and may break without notice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acknowledgment != Completion:&lt;/strong&gt; &lt;code&gt;dispatch()&lt;/code&gt; returns the moment the run is &lt;em&gt;enqueued&lt;/em&gt;, not when the downstream agent has finished executing. Use the &lt;code&gt;dispatch_id&lt;/code&gt; to correlate with run history and confirm &lt;code&gt;phase == "completed"&lt;/code&gt; before treating the work as done.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  8. Run Observability &amp;amp; History
&lt;/h2&gt;

&lt;p&gt;Every routine invocation generates a run record — the primary observability surface for routine-based workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;

&lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# --- List all runs for a routine ---
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Run ID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Phase&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Source&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Started&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attempt_source&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;started_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ERROR: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  dispatch_id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dispatch_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  response_id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- Production pattern: consecutive failure alerting ---
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_routine_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Returns False if the N most recent runs are all failures.
    Integrate with Azure Monitor, PagerDuty, or your alerting pipeline.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;max_failures&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALERT: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; has &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; consecutive failures!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

&lt;span class="n"&gt;is_healthy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_routine_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Routine health check: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;OK&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_healthy&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DEGRADED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Run record fields reference:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Unique run identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;phase&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Fine-grained outcome: &lt;code&gt;completed&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;trigger_type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;schedule&lt;/code&gt;, &lt;code&gt;timer&lt;/code&gt;, or &lt;code&gt;manual&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;attempt_source&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;schedule_delivery&lt;/code&gt;, &lt;code&gt;timer_delivery&lt;/code&gt;, &lt;code&gt;manual_dispatch&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dispatch_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Correlates manual dispatch results to run records&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;response_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Downstream Responses API response ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;error_type&lt;/code&gt; / &lt;code&gt;error_message&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Populated only when &lt;code&gt;phase == "failed"&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  9. Retry Policy &amp;amp; Dispatch Behavior (Expert Corner)
&lt;/h2&gt;

&lt;p&gt;This is where L500-level reasoning separates production-grade routine deployments from brittle ones.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps0duhtwptf2ztr4vqwt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps0duhtwptf2ztr4vqwt.png" alt="Retry policy flow diagram with green success path and red failure path with exponential backoff" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Azure AI Foundry Routines retry policy — 3 attempts with exponential backoff, terminal on non-retryable 4xx&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Default Retry Configuration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total attempts&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backoff strategy&lt;/td&gt;
&lt;td&gt;Exponential&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Initial backoff&lt;/td&gt;
&lt;td&gt;1 second&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max backoff cap&lt;/td&gt;
&lt;td&gt;5 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-attempt HTTP timeout&lt;/td&gt;
&lt;td&gt;30 seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  HTTP Response Classification
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;HTTP Result&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2xx&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Run marked &lt;strong&gt;completed&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;408&lt;/code&gt;, &lt;code&gt;429&lt;/code&gt;, &lt;code&gt;5xx&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Retryable&lt;/strong&gt; — retried up to attempt limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Other &lt;code&gt;4xx&lt;/code&gt; (e.g., &lt;code&gt;400&lt;/code&gt;, &lt;code&gt;404&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Terminal&lt;/strong&gt; — run immediately marked &lt;strong&gt;failed&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request timeout / transient failure&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Retryable&lt;/strong&gt; while attempts remain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  The 30-Second Agent Response Window
&lt;/h3&gt;

&lt;p&gt;The 30-second per-attempt timeout applies to the &lt;strong&gt;downstream HTTP request to the agent endpoint&lt;/strong&gt;, not to total end-to-end agent execution time. If your agent's initial acknowledgment takes longer than 30 seconds — due to cold starts, heavy tool loading, or model warm-up — Foundry will treat the request as timed out and retry it.&lt;/p&gt;

&lt;p&gt;This creates a subtle race condition: if the first attempt times out but the agent actually started processing, you may end up with &lt;strong&gt;two concurrent agent runs&lt;/strong&gt; for the same routine invocation. For idempotency-sensitive workloads, your agent must handle duplicate invocations gracefully.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# --- Production pattern: idempotent agent invocation via conversation_id ---
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_idempotent_conversation_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_date&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Generate a deterministic conversation_id stable across retry attempts
    for the same logical scheduled run. Retried invocations converge on the
    same conversation thread rather than spawning independent parallel executions.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;run_date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conv-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;today&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;conv_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_idempotent_conversation_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Idempotent conversation_id for today: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;conv_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use this conversation_id in the routine action for retry safety
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-ops-summary-idempotent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ops summary with idempotent conversation threading.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekday-morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 7 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AGENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conversation_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;conv_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Stable ID prevents duplicate parallel runs on retry
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  10. Production Best Practices &amp;amp; Architectural Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Multi-Routine Fan-Out for Regional Workloads
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;REGIONAL_SCHEDULES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 7 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;America/New_York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 7 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Europe/London&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apac-east&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 7 * * 1-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Asia/Tokyo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sched&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;REGIONAL_SCHEDULES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;routine_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily-summary-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily summary for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at local 07:00.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schedule&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron_expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cron&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time_zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Business-time semantics per region
&lt;/span&gt;            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_agent_responses_api&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 2: CI/CD-Integrated Routine Lifecycle
&lt;/h3&gt;

&lt;p&gt;Manage routine definitions as &lt;strong&gt;versioned YAML artifacts&lt;/strong&gt; in your repository and upsert them on every deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deploy_routines_from_manifest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Deploy all routines defined in a YAML manifest file.
    Idempotent: create_or_update handles both create and update cases.
    Safe to run on every deployment.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;manifest_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;safe_load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;routine_def&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;routines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_or_update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_def&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;triggers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_def&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;routine_def&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deployed routine: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  (updated_at=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# deploy_routines_from_manifest("infra/routines/production.yaml")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pattern 3: Azure Monitor Observability Integration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timezone&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;emit_routine_runs_to_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logs_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Forward run records to Azure Monitor Logs for dashboarding and alerting.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;runs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;routines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;

    &lt;span class="n"&gt;log_entries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TimeGenerated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;started_at&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;started_at&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timezone&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RoutineName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;routine_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RunId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Phase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;phase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AttemptSource&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attempt_source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DispatchId&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dispatch_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ErrorType&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_type&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ErrorMessage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DurationSeconds&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ended_at&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;started_at&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;started_at&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ended_at&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runs&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;logs_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;rule_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;stream_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;log_entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Emitted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;log_entries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; run records to Azure Monitor.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  11. Known Limitations &amp;amp; What's Coming
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;th&gt;Impact&lt;/th&gt;
&lt;th&gt;Workaround&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1 trigger + 1 action per routine&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No compound triggers or multi-agent chains&lt;/td&gt;
&lt;td&gt;Create separate routines; use agent-side orchestration for chaining&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No event-based triggers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cannot react to blob uploads, queue messages, HTTP webhooks&lt;/td&gt;
&lt;td&gt;Chain with Logic Apps or Event Grid + Function + &lt;code&gt;dispatch&lt;/code&gt; via REST&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cron minimum: 5 minutes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sub-5-min automation not possible&lt;/td&gt;
&lt;td&gt;Use Azure Functions Timer Trigger for sub-5-min workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;30s per-attempt HTTP timeout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slow-starting agents may time out and retry&lt;/td&gt;
&lt;td&gt;Optimize agent cold start; use warm session pools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Regional preview only&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Not available in all Azure regions&lt;/td&gt;
&lt;td&gt;Confirm project region before provisioning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Acknowledgment != end-to-end completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;phase=completed&lt;/code&gt; means HTTP 2xx, not agent task done&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;dispatch_id&lt;/code&gt; + &lt;code&gt;response_id&lt;/code&gt; to poll downstream completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent identity required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompt-only agents are rejected&lt;/td&gt;
&lt;td&gt;Configure agent identity in Foundry portal before binding to routine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Legacy &lt;code&gt;:dispatch&lt;/code&gt; route unsupported&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Direct REST callers may break&lt;/td&gt;
&lt;td&gt;Always use &lt;code&gt;:dispatch_async&lt;/code&gt; exclusively&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  12. Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Azure AI Foundry Routines&lt;/strong&gt; represent a meaningful maturation of the Azure AI platform — one that shifts agent orchestration from an infrastructure concern into a first-class platform primitive. For Azure AI engineers and ML engineers building production-grade AI systems, the implications are concrete: fewer external dependencies, tighter observability, and agents that genuinely run themselves on the schedules your business requires.&lt;/p&gt;

&lt;p&gt;In this deep dive, we covered the full lifecycle of Azure AI Foundry Routines in Python:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trigger architecture&lt;/strong&gt; — cron-based &lt;code&gt;schedule&lt;/code&gt; triggers with IANA timezone semantics and one-shot &lt;code&gt;timer&lt;/code&gt; triggers with flexible &lt;code&gt;at&lt;/code&gt; formats&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action pathways&lt;/strong&gt; — &lt;code&gt;invoke_agent_responses_api&lt;/code&gt; for conversation-threaded workloads vs. &lt;code&gt;invoke_agent_invocations_api&lt;/code&gt; for session-scoped state continuity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle operations&lt;/strong&gt; — idiomatic CRUD, full-replace semantics of &lt;code&gt;create_or_update&lt;/code&gt;, and safe enable/disable patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual dispatch&lt;/strong&gt; — using &lt;code&gt;:dispatch_async&lt;/code&gt; for testing with input overrides and &lt;code&gt;dispatch_id&lt;/code&gt;-based correlation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run observability&lt;/strong&gt; — structured run records, failure alerting, and Azure Monitor integration patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry internals&lt;/strong&gt; — 3-attempt exponential backoff, the 30-second per-attempt timeout, and idempotency design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production patterns&lt;/strong&gt; — multi-routine fan-out, CI/CD-integrated manifest deployments, and regional schedule management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Start with a single daily routine. Wire it to your most repetitive, highest-value agent task. Measure it for a week.&lt;/strong&gt; The operational lift you get back is immediate — and the architectural clarity of a platform-native scheduler over external glue code compounds over time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Ready to build?&lt;/strong&gt; Install &lt;code&gt;azure-ai-projects&amp;gt;=2.2.0&lt;/code&gt;, point it at your Foundry project endpoint, and wire your first Azure AI Foundry Routine in under 15 minutes. The agents are waiting — give them a schedule.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;References:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/foundry/agents/how-to/use-routines?pivots=programming-language-python" rel="noopener noreferrer"&gt;Azure AI Foundry Routines — Official Documentation&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://pypi.org/project/azure-ai-projects/" rel="noopener noreferrer"&gt;azure-ai-projects Python SDK on PyPI&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://ai.azure.com" rel="noopener noreferrer"&gt;Azure AI Foundry Portal&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential" rel="noopener noreferrer"&gt;DefaultAzureCredential — azure-identity&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>azure</category>
      <category>python</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Computer Use Agents Go Local: A Deep Technical Dive into On-Device GUI Automation, Quantized Inference &amp; Holo3.1</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Wed, 03 Jun 2026 04:48:07 +0000</pubDate>
      <link>https://dev.to/monuminu/computer-use-agents-go-local-a-deep-technical-dive-into-on-device-gui-automation-quantized-2m3g</link>
      <guid>https://dev.to/monuminu/computer-use-agents-go-local-a-deep-technical-dive-into-on-device-gui-automation-quantized-2m3g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Learn how to build production-grade local computer use agents using Holo3.1's quantized model family (FP8/NVFP4/GGUF). Deep dive into quantization tradeoffs, the perceive-decide-act loop, Python code examples, and multi-agent orchestration patterns — with zero data leaving your machine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Computer Use Agents Go Local: A Deep Technical Dive into On-Device GUI Automation, Quantized Inference &amp;amp; Holo3.1
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Published: June 3, 2026 · 14 min read&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction: The Privacy Inflection Point&lt;/li&gt;
&lt;li&gt;What Are Computer Use Agents?&lt;/li&gt;
&lt;li&gt;The Case for Local Inference&lt;/li&gt;
&lt;li&gt;Holo3.1: Architecture and Model Family&lt;/li&gt;
&lt;li&gt;Quantization Deep Dive: FP8, NVFP4, and Q4 GGUF&lt;/li&gt;
&lt;li&gt;Setting Up a Local Computer Use Agent&lt;/li&gt;
&lt;li&gt;Building the Screenshot → Action Loop&lt;/li&gt;
&lt;li&gt;Integrating with Agent Frameworks&lt;/li&gt;
&lt;li&gt;Benchmarking Your Agent: OSWorld &amp;amp; AndroidWorld&lt;/li&gt;
&lt;li&gt;Production Deployment Patterns&lt;/li&gt;
&lt;li&gt;The Road Ahead&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Introduction: The Privacy Inflection Point {#introduction}
&lt;/h2&gt;

&lt;p&gt;What if your AI agent never sent a single byte to the cloud?&lt;/p&gt;

&lt;p&gt;Every screenshot it captures, every form it fills, every file it reads — all processed locally, on your hardware, under your control. No API keys. No data-in-transit risk. No per-token billing shock at the end of the month.&lt;/p&gt;

&lt;p&gt;For two years, computer use agents — AI systems that operate GUIs like a human would — have been almost exclusively a cloud-first technology. OpenAI's Operator, Anthropic's Computer Use, and early Holo3 all required round-trips to remote inference endpoints. You'd snap a screenshot, ship it over HTTPS, wait hundreds of milliseconds for a decision, then execute the action. Functional, but fundamentally leaky.&lt;/p&gt;

&lt;p&gt;That changed on &lt;strong&gt;June 2, 2026&lt;/strong&gt;, when Hcompany released &lt;strong&gt;Holo3.1&lt;/strong&gt;: the first production-grade computer use model family to ship quantized checkpoints — FP8, NVFP4, and Q4 GGUF — purpose-built for fully local inference. Simultaneously, the developer internet was buzzing with a 716-upvote Hacker News post proving that a 2016 Intel Xeon with no GPU could run Gemma 4 at acceptable speeds using aggressive speculative decoding. The message was unmistakable: &lt;strong&gt;the local inference era for agentic AI has arrived.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this post, we'll go deep — from the architecture of Holo3.1 and the mathematics of quantization, through to fully working Python code that builds a local computer use agent loop from scratch, and finally to production deployment patterns for teams shipping these systems at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folyj1rqzqhqckn20kazo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folyj1rqzqhqckn20kazo.png" alt="Local Computer Use Agent Loop Diagram" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The perceive-decide-act loop running entirely on-device. No data leaves the machine.&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  2. What Are Computer Use Agents? {#what-are-computer-use-agents}
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;computer use agent (CUA)&lt;/strong&gt; is an AI system that controls a computer's graphical interface — the same way a human does — by interpreting screenshots and emitting discrete GUI actions. Unlike chat-based agents that call APIs or manipulate text, CUAs operate at the &lt;strong&gt;pixel and interaction layer&lt;/strong&gt;: they see what's on screen and decide where to click, what to type, when to scroll, and how to navigate.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Action Space
&lt;/h3&gt;

&lt;p&gt;A typical CUA action space includes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action Type&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;click(x, y)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Left-click at pixel coordinates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;double_click(x, y)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Double-click to open a file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;type(text)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Keyboard input into focused element&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;scroll(x, y, direction, amount)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scroll in any direction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;key(combo)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Keyboard shortcuts (e.g., &lt;code&gt;Ctrl+C&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;screenshot()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Capture current screen state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;drag(x1, y1, x2, y2)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Click-drag for selections or UI moves&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The agent perceives the world exclusively through &lt;strong&gt;screenshots&lt;/strong&gt; (or, in more advanced setups, accessibility tree data), reasons about the current state vs. the goal, and emits the next best action. This loop continues until the task is complete or a termination condition is reached.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why CUAs Matter for Developers
&lt;/h3&gt;

&lt;p&gt;CUAs are compelling for automating workflows that have &lt;strong&gt;no API&lt;/strong&gt; — legacy enterprise software, internal web apps, desktop tools that predate REST, and mobile applications. They're the ultimate last-resort automation layer. More practically, they can now act as the &lt;em&gt;hands&lt;/em&gt; of a larger agentic system: an orchestrator LLM can delegate "book the flight" to a CUA without caring that the airline website doesn't expose a machine-readable endpoint.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. The Case for Local Inference {#the-case-for-local-inference}
&lt;/h2&gt;

&lt;p&gt;Before we dive into Holo3.1 specifically, it's worth grounding why &lt;strong&gt;local inference&lt;/strong&gt; matters so dramatically for computer use agents versus, say, a text summarization task.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fse8n7y3tg738mbtkv7c0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fse8n7y3tg738mbtkv7c0.png" alt="Cloud vs Local Inference Comparison" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Cloud inference introduces latency, cost, and privacy exposure at every agent step.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Privacy: Screenshots Are High-Entropy Data Leaks
&lt;/h3&gt;

&lt;p&gt;When a CUA takes a screenshot to send to a cloud API, it captures everything visible on screen: email contents, financial data, proprietary code, patient information, internal tools, authentication tokens visible in browser address bars. Every API call is a potential compliance violation. For enterprise deployments under HIPAA, SOC 2, GDPR, or internal data classification policies, this is a blocker — not a concern.&lt;/p&gt;

&lt;p&gt;Local inference eliminates the exfiltration surface entirely.&lt;/p&gt;
&lt;h3&gt;
  
  
  Latency: Every Step Counts
&lt;/h3&gt;

&lt;p&gt;A cloud-hosted CUA with ~450ms round-trip latency executing a 20-step task accumulates &lt;strong&gt;9 full seconds of pure network wait time&lt;/strong&gt; — before accounting for inference time. Local inference on a quantized model can cut step time from 6.8 seconds (FP8 on DGX Spark) to 3.3 seconds (NVFP4 on DGX Spark) — a &lt;strong&gt;~2× end-to-end speedup&lt;/strong&gt; demonstrated in Holo3.1's own benchmarks. For interactive workflows, this is the difference between a tool that feels snappy and one that feels broken.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cost at Scale
&lt;/h3&gt;

&lt;p&gt;Consider a workflow executing 1,000 agent steps per day. At a hypothetical $0.01/step cloud API cost, that's $10/day or $3,650/year — just for inference. A local DGX Spark (or even a gaming GPU) amortizes its cost across unlimited inference. The break-even point arrives faster than most teams expect, especially once multi-agent workflows start multiplying step counts.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Memory Wall: Why Local Inference Is Harder Than It Looks
&lt;/h3&gt;

&lt;p&gt;Local inference isn't free of engineering challenges. The fundamental constraint for LLM inference on any hardware is &lt;strong&gt;memory bandwidth&lt;/strong&gt;, not compute throughput. During token generation (the "decoding pass"), the processor must stream the entire model's weights from RAM into cache for every single token emitted.&lt;/p&gt;

&lt;p&gt;On a 2016 Xeon with DDR3 RAM — as demonstrated in the viral Hacker News post — this bottleneck is severe. The CPU is often idle, waiting for data to arrive over the memory bus. The same constraint applies on GPUs (HBM bandwidth limits) and Apple Silicon (unified memory limits). Every quantization format, every inference optimization, and every speculative decoding technique ultimately aims to defeat this memory wall.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. Holo3.1: Architecture and Model Family {#holo31-architecture}
&lt;/h2&gt;

&lt;p&gt;Holo3.1 is built on the &lt;strong&gt;Qwen family&lt;/strong&gt; of base models and fine-tuned with a mixture of web interaction data, desktop automation traces, mobile UI trajectories, and synthetic function-calling demonstrations. The training objective is to maximize task completion rate across three production axes: &lt;strong&gt;environment robustness&lt;/strong&gt;, &lt;strong&gt;framework interoperability&lt;/strong&gt;, and &lt;strong&gt;deployment flexibility&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Model Family
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Holo3.1-0.8B&lt;/td&gt;
&lt;td&gt;0.8B&lt;/td&gt;
&lt;td&gt;Ultra-lightweight edge agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Holo3.1-4B&lt;/td&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;Cost-efficient cloud/local deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Holo3.1-9B&lt;/td&gt;
&lt;td&gt;9B&lt;/td&gt;
&lt;td&gt;Balanced performance and step latency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Holo3.1-35B-A3B&lt;/td&gt;
&lt;td&gt;35B total / 3B active (MoE)&lt;/td&gt;
&lt;td&gt;State-of-the-art task completion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;35B-A3B&lt;/code&gt; suffix on the flagship model signals its &lt;strong&gt;Mixture-of-Experts (MoE)&lt;/strong&gt; architecture: 35 billion total parameters with only ~3 billion active per forward pass. This dramatically reduces per-token compute cost while retaining the capacity of a much larger dense model — a design pattern that has proven critical for making large models viable at inference time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Improvements Over Holo3
&lt;/h3&gt;

&lt;p&gt;The single biggest complaint from Holo3 production deployments was &lt;strong&gt;distribution shift&lt;/strong&gt;: strong benchmark performance that failed to transfer when the model ran inside a third-party agent harness, on a different OS, or against a mobile UI. Holo3.1 addresses this through three targeted improvements:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Cross-Harness Function Calling&lt;/strong&gt;&lt;br&gt;
Holo3.1 now natively supports OpenAI-compatible function-calling protocol in addition to its structured JSON action outputs. This means the same model can plug into LangGraph, CrewAI, AutoGen, or a custom harness without adapter layers. On Holo3.1's internal benchmark suite, function-calling and native JSON execution now achieve &lt;strong&gt;near-parity performance&lt;/strong&gt;, eliminating the 10–15% gap that plagued Holo3 in third-party integrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Mobile GUI Support&lt;/strong&gt;&lt;br&gt;
Holo3.1 was trained on Android UI traces, pushing AndroidWorld benchmark scores from 67% to &lt;strong&gt;79.3%&lt;/strong&gt; on the 35B model and from 58% to &lt;strong&gt;72%&lt;/strong&gt; on the 4B model. Mobile automation — navigating apps, filling forms, handling touch targets — is now a first-class capability rather than a side effect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Quantized Checkpoints for Local Deployment&lt;/strong&gt;&lt;br&gt;
For the first time, Hcompany ships quantized weights alongside the full-precision model. This is the key to local inference viability, and we'll spend the next section unpacking what each format means technically.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Quantization Deep Dive: FP8, NVFP4, and Q4 GGUF {#quantization-deep-dive}
&lt;/h2&gt;

&lt;p&gt;Quantization is the process of reducing the numerical precision of model weights (and optionally activations) to use fewer bits per value. Lower bits = smaller model = less data to move across the memory bus per token = faster inference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7gtcxobv3tn3z4vuo6a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7gtcxobv3tn3z4vuo6a.png" alt="Quantization Format Comparison" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;FP8, NVFP4, and Q4-GGUF offer different precision/throughput/memory tradeoffs. Choose based on your hardware and latency requirements.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  FP8 (8-bit Floating Point)
&lt;/h3&gt;

&lt;p&gt;FP8 uses 8 bits per weight value in a floating-point format (either E4M3 or E5M2). Compared to the standard BF16 (16-bit brain float), FP8 halves model memory footprint while maintaining most of the dynamic range. On NVIDIA H100/H200 and Ada Lovelace GPUs, FP8 is hardware-accelerated via dedicated tensor core units.&lt;/p&gt;

&lt;p&gt;For Holo3.1's 35B-A3B model, FP8 achieves &lt;strong&gt;the same OSWorld score&lt;/strong&gt; as full BF16 — meaning the quantization error is within benchmark noise. It's the safest starting point for any GPU-equipped deployment.&lt;/p&gt;
&lt;h3&gt;
  
  
  NVFP4 (4-bit with W4A16 configuration)
&lt;/h3&gt;

&lt;p&gt;NVFP4 uses NVIDIA's Model Optimizer in a &lt;strong&gt;W4A16 configuration&lt;/strong&gt;: weights stored at 4-bit precision, activations computed at 16-bit. This is distinct from naive INT4 quantization — NVFP4 uses a floating-point 4-bit format (FP4) that better preserves the weight distribution tails critical for instruction following.&lt;/p&gt;

&lt;p&gt;The results are striking: NVFP4 delivers &lt;strong&gt;1.41× the total token throughput of FP8&lt;/strong&gt; and &lt;strong&gt;1.74× that of BF16&lt;/strong&gt; on DGX Spark hardware, while sitting only ~2 OSWorld benchmark points below BF16. The agent step time drops from 6.8s (FP8) to 3.3s (NVFP4) — measured end-to-end including screenshot processing and action execution. This is the format to use if you have NVIDIA Blackwell or Ada architecture hardware and care about throughput.&lt;/p&gt;
&lt;h3&gt;
  
  
  Q4 GGUF (4-bit GGUF for CPU/Consumer GPU)
&lt;/h3&gt;

&lt;p&gt;GGUF (the successor to GGML) is the quantization format powering &lt;code&gt;llama.cpp&lt;/code&gt; and its derivatives (&lt;code&gt;ik_llama.cpp&lt;/code&gt;, &lt;code&gt;ollama&lt;/code&gt;). Q4 quantization uses 4 bits per weight with block-wise quantization scales, making it compatible with CPU inference, Apple Silicon, and modest consumer GPUs.&lt;/p&gt;

&lt;p&gt;The GGUF checkpoints for Holo3.1 are aimed at &lt;strong&gt;consumer hardware deployment&lt;/strong&gt; — running the agent locally on a MacBook Pro, a Windows gaming PC, or a developer laptop. The throughput is lower than FP8/NVFP4 on server hardware, but the total cost of deployment is near-zero beyond the hardware you already own.&lt;/p&gt;
&lt;h3&gt;
  
  
  Which Format Should You Choose?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended Format&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA H100/H200/B100 server&lt;/td&gt;
&lt;td&gt;NVFP4 (W4A16)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA RTX 40/50 series&lt;/td&gt;
&lt;td&gt;FP8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apple Silicon (M3/M4 Pro/Max)&lt;/td&gt;
&lt;td&gt;Q4 GGUF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consumer NVIDIA GPU (RTX 3090/4090)&lt;/td&gt;
&lt;td&gt;Q4 GGUF or FP8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU-only (high RAM)&lt;/td&gt;
&lt;td&gt;Q4 GGUF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lowest latency, highest throughput&lt;/td&gt;
&lt;td&gt;NVFP4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  6. Setting Up a Local Computer Use Agent {#setting-up}
&lt;/h2&gt;

&lt;p&gt;Let's get practical. The following setup targets a machine with either an NVIDIA GPU (for GGUF via llama.cpp) or Apple Silicon. We'll use the &lt;code&gt;Holo3.1-9B&lt;/code&gt; model as a good balance of performance and resource requirements.&lt;/p&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.11+&lt;/li&gt;
&lt;li&gt;16GB+ RAM (32GB recommended for 9B model)&lt;/li&gt;
&lt;li&gt;NVIDIA GPU with 12GB+ VRAM, or Apple Silicon with 16GB+ unified memory&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ffmpeg&lt;/code&gt; for screen capture on Linux
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  huggingface_hub &lt;span class="se"&gt;\&lt;/span&gt;
  llama-cpp-python &lt;span class="se"&gt;\&lt;/span&gt;
  pillow &lt;span class="se"&gt;\&lt;/span&gt;
  pyautogui &lt;span class="se"&gt;\&lt;/span&gt;
  openai &lt;span class="se"&gt;\&lt;/span&gt;
  langgraph &lt;span class="se"&gt;\&lt;/span&gt;
  mss &lt;span class="se"&gt;\&lt;/span&gt;
  anthropic

&lt;span class="c"&gt;# On macOS, also install:&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;ffmpeg

&lt;span class="c"&gt;# On Linux:&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; scrot xdotool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Pulling the Model Weights
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;huggingface_hub&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hf_hub_download&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Download Holo3.1-9B Q4_K_M GGUF (best quality/size tradeoff for 9B)
&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hf_hub_download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;repo_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hcompany/Holo3.1-9B-GGUF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;holo3.1-9b-q4_k_m.gguf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;local_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;local_dir_use_symlinks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model downloaded to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Expected: ./models/holo3.1-9b-q4_k_m.gguf (~5.5GB)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Loading the Model with llama-cpp-python
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_cpp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Llama&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Llama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./models/holo3.1-9b-q4_k_m.gguf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_ctx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# context window
&lt;/span&gt;    &lt;span class="n"&gt;n_gpu_layers&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# offload all layers to GPU (-1 = auto)
&lt;/span&gt;    &lt;span class="n"&gt;n_threads&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# CPU threads for non-GPU layers
&lt;/span&gt;    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Enable flash attention for efficiency
&lt;/span&gt;    &lt;span class="n"&gt;flash_attn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Use memory-mapped loading to reduce RAM pressure
&lt;/span&gt;    &lt;span class="n"&gt;use_mmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;use_mlock&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;# prevent model from being swapped to disk
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model loaded successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;n_ctx&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Building the Screenshot → Action Loop {#screenshot-action-loop}
&lt;/h2&gt;

&lt;p&gt;The core of any computer use agent is the &lt;strong&gt;perceive-decide-act cycle&lt;/strong&gt;. Here's a full, runnable implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BytesIO&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mss&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mss.tools&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pyautogui&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_cpp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Llama&lt;/span&gt;

&lt;span class="c1"&gt;# ── Action execution helpers ──────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute a GUI action returned by the model.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;pyautogui&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Clicked at (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;double_click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;pyautogui&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;doubleClick&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Double-clicked at (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;pyautogui&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;typewrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Typed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;combo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;combo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;pyautogui&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hotkey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;combo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pressed key combo: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;combo&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scroll&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;pyautogui&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scrolled &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; at (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Taking screenshot...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# handled in main loop
&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;action_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_COMPLETE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown action type: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# ── Screenshot capture ────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;capture_screenshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Capture screen and return as base64-encoded JPEG string.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mss&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;monitor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;monitors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;monitor_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;screenshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sct&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;grab&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monitor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;frombytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;screenshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bgra&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BGRX&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Resize to 1280x800 to reduce token count (vision models are pixel-hungry)
&lt;/span&gt;        &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resize&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1280&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LANCZOS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JPEG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getvalue&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="c1"&gt;# ── Agent system prompt ───────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a computer use agent. You observe screenshots of a computer screen
and output JSON actions to accomplish the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s task.

Always respond with a JSON object in this exact format:
{
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;brief explanation of what you see and what action to take&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: {
    &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click|double_click|type|key|scroll|screenshot|done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
&lt;/span&gt;&lt;span class="gp"&gt;    ...&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;specific&lt;/span&gt; &lt;span class="n"&gt;fields&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="s"&gt;For click/double_click: include &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; (pixel coordinates).
For type: include &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.
For key: include &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;combo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; (e.g., &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ctrl+c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;).
For scroll: include &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;direction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; (&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;), &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; (integer).
For done: include &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; with a summary of what was accomplished.

Be precise with coordinates. When the task is complete, use action type &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="c1"&gt;# ── Main agent loop ───────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_computer_use_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Llama&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_delay&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Run a computer use agent loop until task completion or max_steps.

    Args:
        llm: Loaded Llama model instance
        task: Natural language description of the task to complete
        max_steps: Maximum number of action steps before giving up
        step_delay: Seconds to wait after each action (for UI to settle)

    Returns:
        Final result string from the agent
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;🤖 Starting agent for task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;── Step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; ──&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 1. PERCEIVE: Capture current screen state
&lt;/span&gt;        &lt;span class="n"&gt;screenshot_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screenshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;📸 Screenshot captured&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. BUILD PROMPT: Include task, history summary, and current screenshot
&lt;/span&gt;        &lt;span class="n"&gt;history_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;last_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;  &lt;span class="c1"&gt;# keep last 3 for context without blowing budget
&lt;/span&gt;            &lt;span class="n"&gt;history_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Step &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;step&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; → &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action_type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;last_3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;history_summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Recent actions:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;history_summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="n"&gt;user_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;history_summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Current screenshot attached. What action should I take next?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. DECIDE: Call the local model with vision input
&lt;/span&gt;        &lt;span class="c1"&gt;# Note: llama-cpp-python supports multimodal via llava-style image embedding
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_chat_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;screenshot_b64&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                            &lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# low temp for deterministic action selection
&lt;/span&gt;            &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;raw_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🧠 Model output: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 4. PARSE: Extract action from JSON response
&lt;/span&gt;        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;reasoning&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  JSON parse error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Retrying step.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;💭 Reasoning: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚡ Action: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 5. RECORD: Add to history
&lt;/span&gt;        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reasoning&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="c1"&gt;# 6. ACT: Execute the action
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TASK_COMPLETE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;final_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task completed successfully.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;�� Task complete! Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_result&lt;/span&gt;

        &lt;span class="c1"&gt;# Wait for UI to settle after action
&lt;/span&gt;        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_delay&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max steps (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;) reached without task completion.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# ── Entry point ───────────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Load model
&lt;/span&gt;    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Llama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./models/holo3.1-9b-q4_k_m.gguf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;n_ctx&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;n_gpu_layers&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;flash_attn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_mmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;use_mlock&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run a task
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_computer_use_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Open Firefox, navigate to github.com, and search for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;holo3.1&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;step_delay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Final result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  8. Integrating with Agent Frameworks {#agent-frameworks}
&lt;/h2&gt;

&lt;p&gt;For production use, you'll want to integrate your computer use agent into a broader orchestration framework. Holo3.1's support for the &lt;strong&gt;OpenAI-compatible function-calling protocol&lt;/strong&gt; makes this straightforward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj4qfwtgysq2osesi8x3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftj4qfwtgysq2osesi8x3.png" alt="Multi-Agent Orchestration Architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A hierarchical multi-agent system: an orchestrator delegates GUI tasks to a specialist CUA, which operates tools on its behalf.&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  LangGraph Integration
&lt;/h3&gt;

&lt;p&gt;LangGraph lets you build stateful multi-agent workflows as directed graphs. Here's how to wire a Holo3.1-powered computer use node into a LangGraph workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph.message&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AIMessage&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;

&lt;span class="c1"&gt;# ── State definition ──────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;step_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;
    &lt;span class="n"&gt;final_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;

&lt;span class="c1"&gt;# ── Tool definitions for function-calling protocol ────────────────────────────
&lt;/span&gt;
&lt;span class="n"&gt;COMPUTER_USE_TOOLS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;click&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Click at the specified pixel coordinates on screen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X pixel coordinate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Y pixel coordinate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;x&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;y&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Type text into the currently focused element&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Text to type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;press_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Press a keyboard key or key combination&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;combo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key combo e.g. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;enter&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ctrl+c&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;combo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;function&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Signal that the task has been completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summary of what was accomplished&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# ── CUA node ─────────────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;computer_use_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;LangGraph node that runs one step of the computer use agent.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;  &lt;span class="c1"&gt;# Using OpenAI-compatible local server
&lt;/span&gt;
    &lt;span class="c1"&gt;# Point to local llama.cpp server (start with: llama-server -m model.gguf --port 8080)
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# local server doesn't require auth
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;screenshot_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;capture_screenshot&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;screenshot_b64&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Use function-calling protocol (new in Holo3.1)
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;holo3.1-9b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# model name as configured in llama-server
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COMPUTER_USE_TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;tool_calls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute first tool call
&lt;/span&gt;    &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;fn_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
    &lt;span class="n"&gt;fn_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;fn_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fn_args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute the action
&lt;/span&gt;    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fn_name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;press_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;fn_args&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;execute_action&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Router: continue agent loop or end.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;continue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="c1"&gt;# ── Build the graph ───────────────────────────────────────────────────────────
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_computer_use_graph&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;computer_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;computer_use_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;computer_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;computer_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;should_continue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;continue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;computer_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="c1"&gt;# Run it
&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_computer_use_graph&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;final_state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Find the latest Python release on python.org and copy its version number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;step_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;final_result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Benchmarking Your Agent: OSWorld &amp;amp; AndroidWorld {#benchmarking}
&lt;/h2&gt;

&lt;p&gt;Before shipping a CUA to production, you need a rigorous evaluation harness. The two canonical benchmarks for computer use agents are &lt;strong&gt;OSWorld&lt;/strong&gt; (desktop tasks) and &lt;strong&gt;AndroidWorld&lt;/strong&gt; (mobile tasks).&lt;/p&gt;

&lt;h3&gt;
  
  
  OSWorld
&lt;/h3&gt;

&lt;p&gt;OSWorld provides 369 computer tasks across real desktop applications (LibreOffice, GIMP, VS Code, Chrome, etc.) and scores agents on &lt;strong&gt;task success rate&lt;/strong&gt; — did the agent complete the task as specified, verified by automated state checking?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Minimal OSWorld evaluation harness
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_osworld_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Run a single OSWorld task and return pass/fail with metadata.

    task_config example:
    {
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;libreoffice_writer_001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;instruction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bold the first sentence of the document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;libreoffice_writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;check_bold_first_sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
    }
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;instruction&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Launch the task environment (OSWorld uses VMs/containers per task)
&lt;/span&gt;    &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;osworld-env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# let UI settle
&lt;/span&gt;
    &lt;span class="c1"&gt;# Run agent
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Run verifier
&lt;/span&gt;    &lt;span class="n"&gt;verifier_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;osworld-verify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;verifier_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verifier_output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;verifier_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_configs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run full evaluation suite and compute aggregate metrics.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;task_configs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Running task: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;task_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_osworld_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;agent_fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;✅ PASS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;❌ FAIL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Compute metrics
&lt;/span&gt;    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;success_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;passed&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success_rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;success_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;write_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Overall success rate: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;success_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;% (&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;passed&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For reference: Holo3.1-35B-A3B scores &lt;strong&gt;~72%&lt;/strong&gt; on OSWorld in the function-calling harness configuration, up from ~60% for Holo3 in the same setup.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Production Deployment Patterns {#production-deployment}
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Local Single-Machine (Developer / Small Team)
&lt;/h3&gt;

&lt;p&gt;Run &lt;code&gt;llama-server&lt;/code&gt; as a persistent background process, treat it as a local OpenAI-compatible endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start llama.cpp server with Holo3.1-9B&lt;/span&gt;
llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; ./models/holo3.1-9b-q4_k_m.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--flash-attn&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parallel&lt;/span&gt; 4 &lt;span class="se"&gt;\ &lt;/span&gt;       &lt;span class="c"&gt;# handle 4 concurrent agent requests&lt;/span&gt;
  &lt;span class="nt"&gt;--mlock&lt;/span&gt;               &lt;span class="c"&gt;# lock model in RAM&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in your agent code, point &lt;code&gt;openai.base_url&lt;/code&gt; to &lt;code&gt;http://localhost:8080/v1&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: vLLM for High-Throughput Teams
&lt;/h3&gt;

&lt;p&gt;For teams running multiple parallel agents, vLLM with NVFP4 quantization is the gold standard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# vLLM with Holo3.1-35B-A3B using NVFP4 quantization&lt;/span&gt;
vllm serve Hcompany/Holo3.1-35B-A3B-NVFP4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--quantization&lt;/span&gt; nvfp4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 32768 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tensor-parallel-size&lt;/span&gt; 2 &lt;span class="se"&gt;\ &lt;/span&gt; &lt;span class="c"&gt;# for multi-GPU setup&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.90 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-prefix-caching&lt;/span&gt;    &lt;span class="c"&gt;# helps for repetitive system prompts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration, benchmarked on DGX Spark, delivers the &lt;strong&gt;3.3 second average step time&lt;/strong&gt; — down from 6.8s with FP8.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Multi-Agent Orchestration with Tiered Models
&lt;/h3&gt;

&lt;p&gt;A highly effective pattern emerging from the community: use a large orchestrator model to plan and evaluate, while smaller specialist models handle execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Orchestrator delegates GUI tasks to CUA specialist
&lt;/span&gt;&lt;span class="n"&gt;orchestrator_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# cloud or local large model
&lt;/span&gt;&lt;span class="n"&gt;cua_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:8080/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# local Holo3.1
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;orchestrated_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;high_level_goal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Orchestrator decomposes goal into subtasks
&lt;/span&gt;    &lt;span class="n"&gt;decomp_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orchestrator_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Decompose this into ordered GUI subtasks (JSON list): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;high_level_goal&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;subtasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decomp_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subtasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;subtasks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 2: CUA specialist executes each subtask locally
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_computer_use_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cua_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Orchestrator evaluates result and decides whether to continue
&lt;/span&gt;        &lt;span class="n"&gt;eval_response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orchestrator_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Subtask: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Agent result: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Did it succeed? Reply YES or NO.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eval_response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ Subtask failed, retrying: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_computer_use_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cua_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subtask&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern keeps GUI execution on fast, cheap local models (30–40% of tokens) while using cloud models only for high-level reasoning — a cost structure that experienced practitioners report cuts billing by 60–70% versus all-cloud approaches. &lt;em&gt;(verify this stat before publishing)&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  11. The Road Ahead {#road-ahead}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;H2 2026 and beyond&lt;/strong&gt; looks like this for local computer use agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Universal Agents Across Every Surface.&lt;/strong&gt; Holo3.1's mobile gains (AndroidWorld 79.3%) preview a world where a single model handles web, desktop, and mobile automation without environment-specific fine-tuning. Expect iOS automation support to follow as Apple's accessibility APIs open up to agentic runtimes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sub-Second Step Times.&lt;/strong&gt; The current 3.3s step time on DGX Spark with NVFP4 is remarkable but still perceptible. As Blackwell architecture hardware proliferates and FP4 tensor core utilization improves, sub-second per-step latency on consumer hardware is realistic by Q4 2026. That changes the UX equation entirely — agents start to feel like interactive tools rather than background processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-Device Mobile Agents.&lt;/strong&gt; Holo3.1's 0.8B and 4B models hint at the real endpoint: agents running directly on smartphones, with no network dependency. At 4B parameters in Q4 GGUF on an iPhone 17 Pro's 8GB RAM, real-time on-device mobile automation becomes plausible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standardized Action Spaces.&lt;/strong&gt; The shift from bespoke JSON schemas to OpenAI-compatible function-calling in Holo3.1 signals an industry standardization drive. Expect a common &lt;code&gt;computer_use_v1&lt;/code&gt; tool protocol — analogous to how &lt;code&gt;openai.tools&lt;/code&gt; standardized LLM tool use — to emerge from the major players within the next two quarters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Privacy-First Enterprise Adoption.&lt;/strong&gt; The compliance dam is about to break. As local inference quality reaches parity with cloud models (Holo3.1 is nearly there), enterprise legal and security teams — who have been blocking cloud-based CUA deployments for 18 months — will green-light local deployments. The opportunity for the developer ecosystem is enormous.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Conclusion {#conclusion}
&lt;/h2&gt;

&lt;p&gt;The combination of Holo3.1's quantized model family, mature inference runtimes like llama.cpp and vLLM, and the function-calling protocol standardization has removed the last major barrier to &lt;strong&gt;production-grade local computer use agents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We've covered a lot of ground here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The architectural underpinnings of Holo3.1 and its MoE design&lt;/li&gt;
&lt;li&gt;The precise tradeoffs between FP8, NVFP4, and Q4 GGUF quantization formats&lt;/li&gt;
&lt;li&gt;A complete Python implementation of the perceive-decide-act loop&lt;/li&gt;
&lt;li&gt;LangGraph integration using the new function-calling protocol&lt;/li&gt;
&lt;li&gt;Multi-agent orchestration patterns that mix local and cloud models for cost efficiency&lt;/li&gt;
&lt;li&gt;OSWorld/AndroidWorld evaluation methodology to benchmark your own implementations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The technology is ready. The models are available today on HuggingFace. The inference runtimes are mature and well-documented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your move:&lt;/strong&gt; clone the code above, pull &lt;code&gt;Holo3.1-9B-GGUF&lt;/code&gt;, and spin up your first local computer use agent this weekend. The era of privacy-preserving, on-device GUI automation has arrived — and it's faster, cheaper, and more capable than the cloud-only approach that preceded it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this useful? Star the &lt;a href="https://huggingface.co/collections/Hcompany/holo31" rel="noopener noreferrer"&gt;Holo3.1 HuggingFace collection&lt;/a&gt; and join the conversation on the HuggingFace Discord. If you build something cool with this, tag me — I'd love to feature it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;ai-agents&lt;/code&gt; &lt;code&gt;local-llm&lt;/code&gt; &lt;code&gt;computer-use&lt;/code&gt; &lt;code&gt;quantization&lt;/code&gt; &lt;code&gt;llama-cpp&lt;/code&gt; &lt;code&gt;holo3&lt;/code&gt; &lt;code&gt;generative-ai&lt;/code&gt; &lt;code&gt;python&lt;/code&gt; &lt;code&gt;langgraph&lt;/code&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>llm</category>
    </item>
    <item>
      <title>Persistent Agent Memory with Azure AI Foundry: A Complete Developer Guide</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:20:45 +0000</pubDate>
      <link>https://dev.to/monuminu/persistent-agent-memory-with-azure-ai-foundry-a-complete-developer-guide-2n14</link>
      <guid>https://dev.to/monuminu/persistent-agent-memory-with-azure-ai-foundry-a-complete-developer-guide-2n14</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Learn how to build AI agents with persistent memory using Azure AI Foundry Memory Service. A complete developer guide covering concepts, memory types, scope, provisioning, and a full Python implementation with the Foundry Hosted Agent Framework.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Persistent Agent Memory with Azure AI Foundry: A Complete Developer Guide
&lt;/h1&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;What Is Azure AI Foundry Memory?&lt;/li&gt;
&lt;li&gt;Memory Types Deep Dive&lt;/li&gt;
&lt;li&gt;Memory Architecture: How It Really Works&lt;/li&gt;
&lt;li&gt;Access Patterns: Tool vs. Low-Level API&lt;/li&gt;
&lt;li&gt;Understanding Scope&lt;/li&gt;
&lt;li&gt;Hands-On: Provisioning a Memory Store&lt;/li&gt;
&lt;li&gt;Hands-On: Building the Foundry Hosted Memory Agent&lt;/li&gt;
&lt;li&gt;Running &amp;amp; Deploying the Agent&lt;/li&gt;
&lt;li&gt;Security Best Practices&lt;/li&gt;
&lt;li&gt;Quotas, Limits &amp;amp; Regional Availability&lt;/li&gt;
&lt;li&gt;Conclusion + Next Steps&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Imagine you're a developer who has just shipped a polished AI assistant for your SaaS product. Users log in, ask questions, and get sharp, helpful responses. The launch goes well. Then the complaints start rolling in.&lt;/p&gt;

&lt;p&gt;"Why does it keep asking me for my name every single session?"&lt;br&gt;
"I told it last week that I'm vegetarian — why is it recommending steak again?"&lt;br&gt;
"It feels like talking to someone with amnesia."&lt;/p&gt;

&lt;p&gt;This is the stateless agent problem, and it is one of the most frustrating gaps between the promise of conversational AI and the lived reality of production deployments. Every conversation starts from a blank slate. The agent has no idea who it is talking to, what that person prefers, or what was discussed yesterday, last week, or a month ago. The result is a user experience that feels hollow and repetitive — the opposite of the intelligent, personalized assistant your users were promised.&lt;/p&gt;

&lt;p&gt;The solution is persistent memory, and &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; is Microsoft's production-grade answer to this exact problem. Introduced as part of the Azure AI Foundry platform, the Memory Service gives agents the ability to remember facts across sessions, distill long conversation histories into concise summaries, and retrieve the right context at the right moment — all without you having to build and maintain a custom memory system from scratch.&lt;/p&gt;

&lt;p&gt;In this guide, we'll go deep. You'll learn exactly what Azure AI Foundry Memory is, how its three-phase pipeline works under the hood, how to choose the right memory type for your use case, how to provision a Memory Store, and how to build a fully functional hosted memory agent using the Foundry Agent Framework in Python. By the end, you'll have everything you need to ship an AI assistant that actually remembers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms387yy0j28gtwyinu0v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fms387yy0j28gtwyinu0v.png" alt="Without Memory vs With Azure AI Foundry Memory — side by side comparison" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What Is Azure AI Foundry Memory?
&lt;/h2&gt;

&lt;p&gt;At its core, &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; is a managed service within the Azure AI Foundry platform that provides AI agents with long-term, persistent memory across conversational sessions. Rather than relying solely on the context window of an LLM — which is inherently short-lived and discarded at the end of a session — the Memory Service stores extracted facts and conversation summaries in a durable Memory Store that persists indefinitely and can be retrieved semantically on future turns.&lt;/p&gt;

&lt;p&gt;The distinction between short-term and long-term memory is fundamental here. Short-term memory is what the model currently "sees" in its context window: the ongoing conversation, the system prompt, any retrieved documents. This is what most production agents rely on exclusively. Long-term memory, by contrast, is what persists beyond the conversation — the facts, preferences, and summaries that accumulate over time and make an agent feel genuinely knowledgeable about its users.&lt;/p&gt;

&lt;p&gt;Azure AI Foundry Memory provides the long-term layer. It doesn't replace the context window; it feeds it. On each new session, the Memory Service surfaces relevant stored information and injects it into the model's context, seamlessly bridging what was learned in past conversations with what the agent needs to know right now.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Three-Phase Memory Pipeline
&lt;/h3&gt;

&lt;p&gt;The entire lifecycle of a memory — from raw conversation to retrieved context — flows through three distinct phases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwbjp9t2bp730rm2z4cg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwbjp9t2bp730rm2z4cg.png" alt="Azure AI Foundry Memory — Three-Phase Pipeline: Extraction, Consolidation, Retrieval" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Extraction.&lt;/strong&gt; As the conversation progresses, the Memory Service analyzes the dialogue and identifies facts worth remembering. This might be a user's name, a dietary restriction they mentioned in passing, a product preference, or any other detail that could be useful in a future session. The extraction is LLM-powered, meaning it understands context and nuance rather than performing naive keyword matching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Consolidation.&lt;/strong&gt; Raw extracted facts are rarely clean. A user might say "I'm vegetarian" in one session and "I don't eat meat or fish" in another. Consolidation is the process of merging new extractions with existing memories, deduplicating redundant entries, and resolving conflicts. If a user previously said they live in Seattle and now mentions moving to Austin, consolidation ensures the old fact is replaced rather than both being stored as true simultaneously. This phase is what separates a memory system that grows intelligently over time from one that simply accumulates noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Retrieval.&lt;/strong&gt; When a new session begins (or mid-session when relevant), the Memory Service performs a semantic search against the stored memories for that user's scope. It returns the most contextually relevant facts and injects them into the agent's system context. This means the model doesn't receive a dump of every stored memory — it receives precisely the subset that matters for the current conversation.&lt;/p&gt;

&lt;p&gt;This pipeline runs largely automatically when you use the Memory Search Tool access pattern, making it straightforward to add persistent memory to an existing agent with minimal code changes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Memory Types Deep Dive
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry Memory ships with two distinct memory types, each designed for a different category of information. Choosing the right type — or combining both — is key to building a memory system that is both effective and well-governed.&lt;/p&gt;
&lt;h3&gt;
  
  
  User Profile Memory
&lt;/h3&gt;

&lt;p&gt;User Profile Memory is designed for static, durable facts about a person: their name, dietary restrictions, language preferences, accessibility needs, known product configurations, and similar attributes that are unlikely to change frequently and are broadly applicable across any conversation topic.&lt;/p&gt;

&lt;p&gt;The defining characteristic of User Profile Memory is that it is fetched &lt;strong&gt;once, at session start&lt;/strong&gt;. Rather than performing a semantic search every time a potentially relevant memory might exist, the service retrieves the entire user profile upfront and injects it into the system context as a stable foundation. This is efficient and appropriate because user profile facts are almost always relevant — an agent that knows a user is allergic to gluten should factor that in regardless of what the conversation is about.&lt;/p&gt;

&lt;p&gt;You configure User Profile Memory using the &lt;code&gt;user_profile_details&lt;/code&gt; parameter when creating your Memory Store. This is a natural-language instruction to the extraction model describing what kinds of facts should be captured and, importantly, what should be &lt;strong&gt;excluded&lt;/strong&gt;. In the provisioning script we'll walk through later, you'll see this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_profile_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avoid irrelevant or sensitive data, such as age, financials, precise location, and credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This governance instruction is critical. Without explicit guidance, the extraction model might capture personally sensitive information you neither want nor are permitted to store. Always define clear exclusion criteria in &lt;code&gt;user_profile_details&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chat Summary Memory
&lt;/h3&gt;

&lt;p&gt;Chat Summary Memory takes a different approach. Instead of extracting discrete atomic facts, it produces distilled, per-topic summaries of conversation content. Think of it as an intelligent compression layer: instead of storing a 40-message exchange verbatim, Chat Summary Memory produces a compact narrative that captures the essential arc — what was discussed, what decisions were made, what questions remain open.&lt;/p&gt;

&lt;p&gt;Unlike User Profile Memory, Chat Summary Memory is retrieved &lt;strong&gt;contextually&lt;/strong&gt; via semantic search. Not every conversation summary is relevant to every future session — a summary of a technical support conversation is probably not relevant when the user is asking about billing. The semantic retrieval layer ensures that only topically relevant summaries surface, keeping the context window focused and the model's reasoning clean.&lt;/p&gt;

&lt;p&gt;You enable Chat Summary Memory by setting &lt;code&gt;chat_summary_enabled=True&lt;/code&gt; in your &lt;code&gt;MemoryStoreDefaultOptions&lt;/code&gt;. In the sample we're working with, it's disabled (&lt;code&gt;chat_summary_enabled=False&lt;/code&gt;) to keep the demo focused on user profile memory, but for production agents handling diverse conversation topics, enabling both types together is usually the right call.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory Type&lt;/th&gt;
&lt;th&gt;What It Stores&lt;/th&gt;
&lt;th&gt;When Retrieved&lt;/th&gt;
&lt;th&gt;Config Parameter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User Profile Memory&lt;/td&gt;
&lt;td&gt;Static facts: name, preferences, restrictions&lt;/td&gt;
&lt;td&gt;Once at session start&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;user_profile_details&lt;/code&gt; (string instruction)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chat Summary Memory&lt;/td&gt;
&lt;td&gt;Per-topic conversation summaries&lt;/td&gt;
&lt;td&gt;Contextually via semantic search&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chat_summary_enabled=True&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two memory types are complementary. User Profile Memory gives the agent a stable, always-available picture of who it's talking to. Chat Summary Memory gives it the ability to recall the arc of past conversations in relevant contexts. Together, they provide a comprehensive long-term memory foundation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Architecture: How It Really Works
&lt;/h2&gt;

&lt;p&gt;Understanding the architectural components of Azure AI Foundry Memory helps you reason about isolation, scalability, and the operational characteristics of the system you're building.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Memory Store
&lt;/h3&gt;

&lt;p&gt;The Memory Store is the top-level resource — the durable storage container that holds all memories. You create one Memory Store per application or use case. The store is associated with your Azure AI Foundry project and requires two model deployments to function: a &lt;strong&gt;chat model&lt;/strong&gt; (used for extraction and consolidation, e.g., &lt;code&gt;gpt-4.1-mini&lt;/code&gt;) and an &lt;strong&gt;embedding model&lt;/strong&gt; (used for semantic retrieval, e.g., &lt;code&gt;text-embedding-3-small&lt;/code&gt;). Both must be available within your Foundry project.&lt;/p&gt;

&lt;p&gt;The Memory Store is not a simple database. It is an intelligent service layer that orchestrates LLM-powered extraction, runs consolidation logic, maintains vector embeddings for semantic search, and handles scope-based isolation — all managed for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scopes and Isolation
&lt;/h3&gt;

&lt;p&gt;Within a single Memory Store, memories are organized by &lt;strong&gt;scope&lt;/strong&gt;. A scope is a logical namespace that isolates one set of memories from another. In almost all multi-user scenarios, each user gets their own scope, ensuring that one user's memories are completely invisible to another user's sessions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmj33o61jxnbuzxnz19m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmj33o61jxnbuzxnz19m.png" alt="Azure AI Foundry Memory Store — Scope Isolation per User" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scope is a string identifier, and the platform supports dynamic resolution of this string at runtime (more on this in the Understanding Scope section). Within a scope, the Memory Store maintains all extracted memories and summaries for that logical entity. You can have up to &lt;strong&gt;100 scopes per store&lt;/strong&gt; and up to &lt;strong&gt;10,000 memories per scope&lt;/strong&gt;, giving you room to scale across a substantial user base within a single store instance.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM-Powered Consolidation
&lt;/h3&gt;

&lt;p&gt;The consolidation phase deserves special attention because it is what distinguishes Azure AI Foundry Memory from a naive "append facts to a database" approach. After each conversation turn (or at session end, depending on configuration), the extraction model identifies new facts from the recent exchange and the consolidation model compares them against existing stored memories for that scope.&lt;/p&gt;

&lt;p&gt;Consolidation performs three operations: &lt;strong&gt;merge&lt;/strong&gt; (combining related facts into a single, richer record), &lt;strong&gt;deduplication&lt;/strong&gt; (discarding new facts that are already represented), and &lt;strong&gt;conflict resolution&lt;/strong&gt; (updating or replacing facts that contradict existing memories). This is what allows the memory store to remain clean and authoritative over time, rather than becoming an ever-growing pile of contradictory statements.&lt;/p&gt;

&lt;p&gt;The billing model reflects this LLM-powered approach: you are billed based on underlying model usage (chat model tokens for extraction/consolidation, embedding model calls for retrieval), not on a per-memory or per-store flat fee. Keep this in mind when designing high-volume deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Access Patterns: Tool vs. Low-Level API
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry Memory exposes two distinct integration patterns, and choosing the right one shapes how much control you have versus how much complexity you manage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Search Tool
&lt;/h3&gt;

&lt;p&gt;The Memory Search Tool is the high-level, agent-native access pattern. You attach it to your agent as a tool, and the framework handles the read/write lifecycle automatically. At the start of each session, user profile memories are injected into the system context. During the session, the tool can perform contextual semantic searches when the agent's reasoning determines that retrieving additional memories is warranted. At the end of the turn, newly extracted facts are written back to the store.&lt;/p&gt;

&lt;p&gt;This is the pattern used in the Foundry Agent Framework sample we'll build in this guide, implemented via &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;. It is the right choice for the vast majority of production agents because it requires minimal code, works consistently across session types, and handles the extraction/consolidation lifecycle without custom orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Low-Level Memory Store APIs
&lt;/h3&gt;

&lt;p&gt;For scenarios requiring precise control — custom extraction logic, selective memory writes, cross-scope queries, or integration with non-agent systems — the Memory Store APIs provide direct programmatic access. You can call &lt;code&gt;memory_stores.search()&lt;/code&gt; to perform arbitrary semantic searches, write memories explicitly, delete specific records, and enumerate the contents of a scope.&lt;/p&gt;

&lt;p&gt;The low-level APIs are accessed via the &lt;code&gt;project.beta.memory_stores&lt;/code&gt; client in the Azure AI Projects SDK (the &lt;code&gt;allow_preview=True&lt;/code&gt; flag is required, as the API is currently in public preview). These APIs give you the full power of the memory service but require you to implement your own retrieval and injection logic.&lt;/p&gt;

&lt;p&gt;For most teams shipping their first memory-enabled agent, starting with the Memory Search Tool is strongly recommended. Reach for the low-level APIs when you have a specific requirement that the tool abstraction cannot satisfy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding Scope
&lt;/h2&gt;

&lt;p&gt;Scope is one of the most important concepts in Azure AI Foundry Memory, and it is worth spending time to understand how it resolves at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;{{$userId}}&lt;/code&gt; Template
&lt;/h3&gt;

&lt;p&gt;In the agent code, you'll specify the scope as the string literal &lt;code&gt;"{{$userId}}"&lt;/code&gt;. This is a &lt;strong&gt;template placeholder&lt;/strong&gt; that the Foundry hosting infrastructure replaces with the actual, authenticated user identity at runtime. You never need to hard-code or dynamically construct the scope string in your application code; the platform resolves it for you.&lt;/p&gt;

&lt;p&gt;The resolution logic uses two potential sources, in order of precedence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;x-memory-user-id&lt;/code&gt; header:&lt;/strong&gt; If the HTTP request to the hosted agent includes this header, its value is used as the user identity for scope resolution. This is useful when you have your own user identity system and want to pass a stable internal user ID (e.g., a database primary key or a GUID from your own identity provider) to the memory service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Entra ID TID+OID fallback:&lt;/strong&gt; If no &lt;code&gt;x-memory-user-id&lt;/code&gt; header is present, the platform falls back to the combination of the authenticated user's Azure Entra tenant ID and object ID. This works automatically in scenarios where your users authenticate via Entra, ensuring that each Entra user naturally gets their own memory scope without any additional instrumentation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Static Scopes
&lt;/h3&gt;

&lt;p&gt;For non-user-specific memory — shared knowledge that applies to all users of an agent, or a single-user personal assistant — you can use a static string as the scope instead of the &lt;code&gt;{{$userId}}&lt;/code&gt; template. For example, &lt;code&gt;scope="shared"&lt;/code&gt; or &lt;code&gt;scope="assistant-owner"&lt;/code&gt;. Static scopes are a good fit for personal productivity agents where a single person owns the deployment, or for storing organization-wide facts that should be available to all users.&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC Requirements
&lt;/h3&gt;

&lt;p&gt;Accessing the Memory Store requires the &lt;strong&gt;Azure AI User&lt;/strong&gt; role assigned on your Foundry project scope in Azure RBAC. This applies to both the agent's runtime identity (typically a managed identity or a service principal) and to developers running scripts locally (typically via &lt;code&gt;az login&lt;/code&gt; and &lt;code&gt;DefaultAzureCredential&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdau4plr6tahfikcq70x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcdau4plr6tahfikcq70x.png" alt="Azure Portal — Azure AI User RBAC Role Assignment for Foundry Project" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Make sure this role assignment is in place before running either the provisioning script or the agent itself; you'll encounter authorization errors at both the &lt;code&gt;memory_stores.get()&lt;/code&gt; and &lt;code&gt;memory_stores.create()&lt;/code&gt; calls otherwise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: Provisioning a Memory Store
&lt;/h2&gt;

&lt;p&gt;Before your agent can use persistent memory, you need to create a Memory Store within your Foundry project. The following script, &lt;code&gt;provision_memory_store.py&lt;/code&gt;, handles idempotent provisioning — it creates the store if it doesn't exist, and safely no-ops if it already does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;

&lt;p&gt;Start by setting the required environment variables. These values come from your Azure AI Foundry project settings and your deployed model names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;account&amp;gt;.services.ai.azure.com/api/projects/&amp;lt;project&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"agent_framework_memory"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The endpoint follows the standard Foundry project endpoint format. The model deployment names must correspond to models that are actually deployed within your project — the memory service will call these models directly for extraction, consolidation, and embedding generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Provisioning Script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Copyright (c) Microsoft. All rights reserved.
&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Provision the Azure AI Foundry Memory Store used by this sample.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MemoryStoreDefaultDefinition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MemoryStoreDefaultOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;memory_store_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;chat_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_preview&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; already exists (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;); leaving as-is.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creating memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;definition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStoreDefaultDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chat_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chat_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MemoryStoreDefaultOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;chat_summary_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;user_profile_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;user_profile_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avoid irrelevant or sensitive data, such as age, financials, precise location, and credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store for the Agent Framework foundry-hosted memory sample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;definition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;definition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; was not found after creation; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the service may not have persisted it.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Verified memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is available on the service (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Walkthrough
&lt;/h3&gt;

&lt;p&gt;A few implementation details are worth calling out explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;allow_preview=True&lt;/code&gt;:&lt;/strong&gt; This flag on the &lt;code&gt;AIProjectClient&lt;/code&gt; constructor is required to access the &lt;code&gt;project.beta&lt;/code&gt; namespace, which hosts the memory store APIs. Because the Memory Service is currently in public preview, the beta client is where the APIs live. This flag is a deliberate opt-in to acknowledge that the API surface may evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency via &lt;code&gt;ResourceNotFoundError&lt;/code&gt;:&lt;/strong&gt; The script first attempts a &lt;code&gt;GET&lt;/code&gt; on the named store. If it succeeds, the store already exists and the script exits cleanly. Only if a &lt;code&gt;ResourceNotFoundError&lt;/code&gt; is raised does the script proceed to create. This makes the provisioning script safe to re-run in CI/CD pipelines or developer setup scripts without risk of accidentally creating duplicate stores or raising errors on subsequent runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-creation verification:&lt;/strong&gt; After &lt;code&gt;create()&lt;/code&gt; returns, the script performs a second &lt;code&gt;GET&lt;/code&gt; to confirm the store is actually visible on the service. This guards against an edge case where the creation call returns success but the resource isn't yet available — a realistic concern with eventually consistent distributed services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;MemoryStoreDefaultOptions&lt;/code&gt;:&lt;/strong&gt; This is where you configure which memory types are active. Here, &lt;code&gt;user_profile_enabled=True&lt;/code&gt; activates profile memory with the specified exclusion instruction, while &lt;code&gt;chat_summary_enabled=False&lt;/code&gt; keeps summaries off for this sample. Adjust these settings based on your application's memory requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;DefaultAzureCredential&lt;/code&gt;:&lt;/strong&gt; Authentication uses the standard Azure credential chain, which automatically picks up &lt;code&gt;az login&lt;/code&gt; tokens during local development, managed identity in Azure-hosted environments, and environment-variable credentials in CI. You don't need to manage credentials explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: Building the Foundry Hosted Memory Agent
&lt;/h2&gt;

&lt;p&gt;With the Memory Store provisioned, it's time to build the agent. The following &lt;code&gt;main.py&lt;/code&gt; implements a fully functional hosted memory agent using the Foundry Agent Framework's &lt;code&gt;FoundryChatClient&lt;/code&gt;, &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;, and &lt;code&gt;ResponsesHostServer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddhoi2pv8uw76xqbeaex.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddhoi2pv8uw76xqbeaex.png" alt="FoundryMemoryProvider Context Provider Lifecycle Flow per Agent Turn" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Copyright (c) Microsoft. All rights reserved.
&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Foundry Memory hosted agent sample.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.foundry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FoundryMemoryProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework_foundry_hosting&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ResponsesHostServer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_resolved_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;project_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;allow_preview&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;memory_store_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_resolved_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context_providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME is not set; memory will not be available to the agent.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memory_provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryMemoryProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;project_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{$userId}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that remembers facts the user has shared &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;across conversations. Relevant memories from previous interactions are &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;automatically provided to you in the system context. Use them when &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answering, and acknowledge when you are relying on remembered facts.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResponsesHostServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_async&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Walkthrough
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;FoundryChatClient&lt;/code&gt;:&lt;/strong&gt; This is the Foundry-native chat client from the Agent Framework. It wraps the &lt;code&gt;AIProjectClient&lt;/code&gt; and handles authentication, model routing, and API versioning. Crucially, &lt;code&gt;client.project_client&lt;/code&gt; exposes the underlying &lt;code&gt;AIProjectClient&lt;/code&gt; instance, which is then reused by the &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;. This single-auth-context pattern means your agent uses one credential and one client for both chat completions and memory operations — no separate credential configuration required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;_resolved_env&lt;/code&gt;:&lt;/strong&gt; This helper function detects unresolved template placeholders in environment variables — strings that look like &lt;code&gt;${VARIABLE}&lt;/code&gt; or &lt;code&gt;{{VARIABLE}}&lt;/code&gt; — and returns an empty string in those cases. This handles the scenario where a deployment template has been applied but the &lt;code&gt;MEMORY_STORE_NAME&lt;/code&gt; variable wasn't injected correctly, preventing the agent from crashing at startup with a cryptic error. Instead, it degrades gracefully by skipping memory initialization and logging a warning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;FoundryMemoryProvider&lt;/code&gt;:&lt;/strong&gt; This is the core memory integration component. It takes the shared &lt;code&gt;project_client&lt;/code&gt;, the memory store name, and the &lt;code&gt;scope&lt;/code&gt; template string. When instantiated with &lt;code&gt;scope="{{$userId}}"&lt;/code&gt;, the provider passes this template to the hosting infrastructure, which resolves it to the actual user identity at request time. The &lt;code&gt;FoundryMemoryProvider&lt;/code&gt; acts as a context provider — on each agent turn, it retrieves relevant memories and injects them into the system context before the model processes the user's message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;default_options={"store": False}&lt;/code&gt;:&lt;/strong&gt; This tells the Agent Framework not to maintain its own conversation history store. Conversation history management is delegated to the hosting infrastructure — the &lt;code&gt;ResponsesHostServer&lt;/code&gt; and the Foundry platform — rather than being handled at the application layer. This is the correct approach for agents hosted on Foundry, as the platform manages session state natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;context_providers&lt;/code&gt;:&lt;/strong&gt; The agent accepts a list of context providers, each of which can inject additional context into the system prompt before each model call. By appending the &lt;code&gt;FoundryMemoryProvider&lt;/code&gt; to this list, you wire memory retrieval directly into the agent's reasoning loop. Adding other context providers (e.g., RAG document retrievers, user preference loaders) follows the same pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;ResponsesHostServer&lt;/code&gt;:&lt;/strong&gt; This class wraps the agent in a server that implements the OpenAI Responses API protocol, making it compatible with any client that speaks that protocol. It handles the HTTP serving, session routing, and lifecycle management, letting you focus on agent logic rather than networking boilerplate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running &amp;amp; Deploying the Agent
&lt;/h2&gt;

&lt;p&gt;With the code in place, you have three paths to run and deploy your memory agent: the &lt;code&gt;azd&lt;/code&gt; CLI, the VS Code Foundry Toolkit extension, and a direct Python execution for quick local testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using &lt;code&gt;azd&lt;/code&gt; (Recommended)
&lt;/h3&gt;

&lt;p&gt;The Azure Developer CLI with the AI agents extension provides the most streamlined workflow from local development to production deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install azd AI agent extension&lt;/span&gt;
azd ext &lt;span class="nb"&gt;install &lt;/span&gt;azure.ai.agents

&lt;span class="c"&gt;# Initialize agent from manifest&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-memory-agent &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-memory-agent
azd ai agent init &lt;span class="nt"&gt;-m&lt;/span&gt; https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/hosted-agents/agent-framework/responses/13-foundry-memory/agent.manifest.yaml

&lt;span class="c"&gt;# Run locally&lt;/span&gt;
azd ai agent run

&lt;span class="c"&gt;# Invoke locally&lt;/span&gt;
azd ai agent invoke &lt;span class="nt"&gt;--local&lt;/span&gt; &lt;span class="s2"&gt;"Hi, I'm allergic to dairy"&lt;/span&gt;

&lt;span class="c"&gt;# Set memory store name in azd env&lt;/span&gt;
azd &lt;span class="nb"&gt;env set &lt;/span&gt;MEMORY_STORE_NAME &lt;span class="s2"&gt;"agent_framework_memory"&lt;/span&gt;

&lt;span class="c"&gt;# Deploy to Foundry&lt;/span&gt;
azd deploy

&lt;span class="c"&gt;# Invoke deployed agent&lt;/span&gt;
azd ai agent invoke &lt;span class="s2"&gt;"What are my allergies?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;azd ai agent init&lt;/code&gt; command bootstraps your local project from the agent manifest, pulling down the correct &lt;code&gt;main.py&lt;/code&gt;, &lt;code&gt;requirements.txt&lt;/code&gt;, and configuration files. The &lt;code&gt;azd ai agent run&lt;/code&gt; command starts the agent locally, while &lt;code&gt;azd ai agent invoke --local&lt;/code&gt; sends a test message directly. Notice the workflow in the example: first you tell the agent you're allergic to dairy, then — in a separate invocation — you ask what your allergies are. If memory is working correctly, the second invocation should recall the dairy allergy from the first, even though it was a different "session."&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;azd deploy&lt;/code&gt;, your agent runs on Azure Foundry infrastructure with full managed scaling, identity, and observability. The &lt;code&gt;azd ai agent invoke&lt;/code&gt; command (without &lt;code&gt;--local&lt;/code&gt;) sends the request to the deployed endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  VS Code Foundry Toolkit
&lt;/h3&gt;

&lt;p&gt;If you prefer a GUI-driven workflow, the VS Code Foundry Toolkit extension provides a point-and-click interface for initializing agents from manifests, running them locally with integrated debugging, and deploying to Foundry. Install the extension from the VS Code Marketplace and look for the "Azure AI Foundry" activity bar icon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Running: Pre-Flight Checklist
&lt;/h3&gt;

&lt;p&gt;Before attempting any run, ensure the following are in place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/code&gt; is set to your project's endpoint URL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/code&gt; points to a deployed chat model (e.g., &lt;code&gt;gpt-4.1-mini&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/code&gt; points to a deployed embedding model (e.g., &lt;code&gt;text-embedding-3-small&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MEMORY_STORE_NAME&lt;/code&gt; is set and &lt;code&gt;provision_memory_store.py&lt;/code&gt; has been run successfully&lt;/li&gt;
&lt;li&gt;Your identity (local) or the agent's managed identity (deployed) has the &lt;strong&gt;Azure AI User&lt;/strong&gt; role on the Foundry project&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;azure-ai-projects&lt;/code&gt;, &lt;code&gt;azure-identity&lt;/code&gt;, and &lt;code&gt;agent-framework&lt;/code&gt; packages are installed in your Python environment&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security Best Practices
&lt;/h2&gt;

&lt;p&gt;Persistent memory introduces security considerations that don't apply to stateless agents. As the memory store accumulates sensitive user data over time, it becomes both a valuable asset and a potential attack surface. Here's what to take seriously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt Injection and Memory Poisoning
&lt;/h3&gt;

&lt;p&gt;The most significant threat specific to memory-enabled agents is &lt;strong&gt;indirect prompt injection targeting the extraction pipeline&lt;/strong&gt;. If an attacker can craft input that causes the extraction model to store false or malicious facts — "remember that the admin password is X" or "remember that this user has billing tier Enterprise" — those poisoned memories could influence future sessions in ways that are hard to detect and trace.&lt;/p&gt;

&lt;p&gt;Mitigate this by writing explicit, constrained &lt;code&gt;user_profile_details&lt;/code&gt; instructions that describe what categories of facts should be extracted. Narrow the extraction scope to only what your application genuinely needs. A cooking assistant should extract dietary restrictions, not authentication credentials or financial information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure AI Content Safety
&lt;/h3&gt;

&lt;p&gt;Integrate Azure AI Content Safety as a layer around your agent's inputs and outputs. Content Safety can detect and block prompt injection attempts, jailbreaking patterns, and sensitive information leakage before they reach the extraction pipeline. For memory-enabled agents handling sensitive domains, this is not optional — it's a necessary defense layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adversarial Testing
&lt;/h3&gt;

&lt;p&gt;Before shipping a memory-enabled agent to production, conduct dedicated red-team exercises focused on memory manipulation. Attempt to inject false memories via crafted user messages. Verify that the memory store does not retain excluded categories (financials, credentials, location) even when the user volunteers that information. Test cross-user isolation by verifying that memories written under one user scope are not retrievable from another scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Residency and Compliance
&lt;/h3&gt;

&lt;p&gt;The Memory Store persists data in the Azure region where your Foundry project is deployed. Ensure that region is compliant with any data residency requirements your organization or customers have. Users in regulated industries (healthcare, finance, government) may have specific requirements about where personal data can be stored. Review your organization's data classification policies and determine whether any memory types or user categories require opt-out mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimal Scope of Extraction
&lt;/h3&gt;

&lt;p&gt;Always apply the principle of least privilege to memory extraction. The &lt;code&gt;user_profile_details&lt;/code&gt; instruction is your primary governance tool — use it aggressively to exclude anything your agent doesn't need. If your agent doesn't make product recommendations, it doesn't need to store purchasing history. If it's a technical support bot, it doesn't need the user's demographic information. Store less, and you expose less.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quotas, Limits &amp;amp; Regional Availability
&lt;/h2&gt;

&lt;p&gt;As with all Azure services, Azure AI Foundry Memory has quotas and regional constraints to plan around. Here is a summary of the key limits:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Limit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max scopes per Memory Store&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max memories per scope&lt;/td&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory search requests per minute&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory update requests per minute&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Billing basis&lt;/td&gt;
&lt;td&gt;Underlying model usage (tokens + embeddings)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preview status&lt;/td&gt;
&lt;td&gt;Public preview&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The per-scope memory limit of 10,000 is generous for most use cases, but if your agent is designed for very long-lived users who interact frequently, you should implement a memory lifecycle policy — periodically consolidating or pruning older, less relevant memories to stay well within the limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regional Availability
&lt;/h3&gt;

&lt;p&gt;Azure AI Foundry Memory is available in the following regions as of the time of writing. Always check the &lt;a href="https://learn.microsoft.com/azure/ai-foundry/" rel="noopener noreferrer"&gt;official Azure documentation&lt;/a&gt; for the latest regional coverage, as new regions are added regularly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Australia East&lt;/li&gt;
&lt;li&gt;Brazil South&lt;/li&gt;
&lt;li&gt;Canada East&lt;/li&gt;
&lt;li&gt;East US 2&lt;/li&gt;
&lt;li&gt;France Central&lt;/li&gt;
&lt;li&gt;Italy North&lt;/li&gt;
&lt;li&gt;Japan East&lt;/li&gt;
&lt;li&gt;Korea Central&lt;/li&gt;
&lt;li&gt;North Central US&lt;/li&gt;
&lt;li&gt;Norway East&lt;/li&gt;
&lt;li&gt;South Africa North&lt;/li&gt;
&lt;li&gt;South India&lt;/li&gt;
&lt;li&gt;Sweden Central&lt;/li&gt;
&lt;li&gt;Switzerland North&lt;/li&gt;
&lt;li&gt;UAE North&lt;/li&gt;
&lt;li&gt;UK South&lt;/li&gt;
&lt;li&gt;West US&lt;/li&gt;
&lt;li&gt;West US 2&lt;/li&gt;
&lt;li&gt;West US 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your target deployment region is not on this list, you may need to deploy your Foundry project to a supported region. For global deployments, consider the latency implications of cross-region memory lookups and whether a region close to your primary user base is available.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion + Next Steps
&lt;/h2&gt;

&lt;p&gt;Stateless agents are a frustrating limitation that erodes user trust and undermines the value of conversational AI. &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; solves this with a production-grade, managed memory service that handles extraction, consolidation, and contextual retrieval automatically — letting you focus on building the agent experience rather than the memory infrastructure.&lt;/p&gt;

&lt;p&gt;In this guide, we've covered the complete picture: the three-phase memory pipeline that transforms raw conversation into durable, organized knowledge; the two memory types (User Profile Memory and Chat Summary Memory) and when to use each; the scoping model that enforces per-user isolation; the RBAC requirements; and a full end-to-end Python implementation using &lt;code&gt;FoundryChatClient&lt;/code&gt;, &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;, and &lt;code&gt;ResponsesHostServer&lt;/code&gt; from the Foundry Agent Framework.&lt;/p&gt;

&lt;p&gt;The key things to take away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;code&gt;provision_memory_store.py&lt;/code&gt; once to create your Memory Store before starting the agent — provisioning is idempotent and safe to re-run.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;scope="{{$userId}}"&lt;/code&gt; and let the platform resolve the actual user identity — don't try to manage scope strings manually in application code.&lt;/li&gt;
&lt;li&gt;Define explicit extraction constraints in &lt;code&gt;user_profile_details&lt;/code&gt; to govern what gets stored and protect user privacy.&lt;/li&gt;
&lt;li&gt;Test memory persistence end-to-end using &lt;code&gt;azd ai agent invoke --local&lt;/code&gt; before deploying — the round-trip from storage to retrieval is the behavior that matters most.&lt;/li&gt;
&lt;li&gt;Take prompt injection and memory poisoning seriously; integrate Azure AI Content Safety and conduct adversarial testing before going to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sample code in this guide is drawn from the official Microsoft Foundry Samples repository. To get started immediately, explore the following resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📚 &lt;a href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/memory" rel="noopener noreferrer"&gt;Azure AI Foundry Memory Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;a href="https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/agent-framework/responses/13-foundry-memory" rel="noopener noreferrer"&gt;GitHub Sample #13 — Foundry Hosted Memory Agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;a href="https://learn.microsoft.com/python/api/azure-ai-projects/" rel="noopener noreferrer"&gt;Azure AI Projects SDK Reference&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🚀 &lt;a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/" rel="noopener noreferrer"&gt;Azure Developer CLI (azd) Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The era of agents that actually remember is here. Start building.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>Persistent Agent Memory with Azure AI Foundry: A Complete Developer Guide</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:12:47 +0000</pubDate>
      <link>https://dev.to/monuminu/persistent-agent-memory-with-azure-ai-foundry-a-complete-developer-guide-1pom</link>
      <guid>https://dev.to/monuminu/persistent-agent-memory-with-azure-ai-foundry-a-complete-developer-guide-1pom</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Learn how to build AI agents with persistent memory using Azure AI Foundry Memory Service. A complete developer guide covering concepts, memory types, scope, provisioning, and a full Python implementation with the Foundry Hosted Agent Framework.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Persistent Agent Memory with Azure AI Foundry: A Complete Developer Guide
&lt;/h1&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;What Is Azure AI Foundry Memory?&lt;/li&gt;
&lt;li&gt;Memory Types Deep Dive&lt;/li&gt;
&lt;li&gt;Memory Architecture: How It Really Works&lt;/li&gt;
&lt;li&gt;Access Patterns: Tool vs. Low-Level API&lt;/li&gt;
&lt;li&gt;Understanding Scope&lt;/li&gt;
&lt;li&gt;Hands-On: Provisioning a Memory Store&lt;/li&gt;
&lt;li&gt;Hands-On: Building the Foundry Hosted Memory Agent&lt;/li&gt;
&lt;li&gt;Running &amp;amp; Deploying the Agent&lt;/li&gt;
&lt;li&gt;Security Best Practices&lt;/li&gt;
&lt;li&gt;Quotas, Limits &amp;amp; Regional Availability&lt;/li&gt;
&lt;li&gt;Conclusion + Next Steps&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Imagine you're a developer who has just shipped a polished AI assistant for your SaaS product. Users log in, ask questions, and get sharp, helpful responses. The launch goes well. Then the complaints start rolling in.&lt;/p&gt;

&lt;p&gt;"Why does it keep asking me for my name every single session?"&lt;br&gt;
"I told it last week that I'm vegetarian — why is it recommending steak again?"&lt;br&gt;
"It feels like talking to someone with amnesia."&lt;/p&gt;

&lt;p&gt;This is the stateless agent problem, and it is one of the most frustrating gaps between the promise of conversational AI and the lived reality of production deployments. Every conversation starts from a blank slate. The agent has no idea who it is talking to, what that person prefers, or what was discussed yesterday, last week, or a month ago. The result is a user experience that feels hollow and repetitive — the opposite of the intelligent, personalized assistant your users were promised.&lt;/p&gt;

&lt;p&gt;The solution is persistent memory, and &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; is Microsoft's production-grade answer to this exact problem. Introduced as part of the Azure AI Foundry platform, the Memory Service gives agents the ability to remember facts across sessions, distill long conversation histories into concise summaries, and retrieve the right context at the right moment — all without you having to build and maintain a custom memory system from scratch.&lt;/p&gt;

&lt;p&gt;In this guide, we'll go deep. You'll learn exactly what Azure AI Foundry Memory is, how its three-phase pipeline works under the hood, how to choose the right memory type for your use case, how to provision a Memory Store, and how to build a fully functional hosted memory agent using the Foundry Agent Framework in Python. By the end, you'll have everything you need to ship an AI assistant that actually remembers.&lt;/p&gt;

&lt;p&gt;[IMAGE: Side-by-side comparison diagram: Left side shows "Without Memory" — user repeats preferences every session; Right side shows "With Azure AI Foundry Memory" — agent greets user by name, recalls preferences, picks up mid-conversation]&lt;/p&gt;


&lt;h2&gt;
  
  
  What Is Azure AI Foundry Memory?
&lt;/h2&gt;

&lt;p&gt;At its core, &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; is a managed service within the Azure AI Foundry platform that provides AI agents with long-term, persistent memory across conversational sessions. Rather than relying solely on the context window of an LLM — which is inherently short-lived and discarded at the end of a session — the Memory Service stores extracted facts and conversation summaries in a durable Memory Store that persists indefinitely and can be retrieved semantically on future turns.&lt;/p&gt;

&lt;p&gt;The distinction between short-term and long-term memory is fundamental here. Short-term memory is what the model currently "sees" in its context window: the ongoing conversation, the system prompt, any retrieved documents. This is what most production agents rely on exclusively. Long-term memory, by contrast, is what persists beyond the conversation — the facts, preferences, and summaries that accumulate over time and make an agent feel genuinely knowledgeable about its users.&lt;/p&gt;

&lt;p&gt;Azure AI Foundry Memory provides the long-term layer. It doesn't replace the context window; it feeds it. On each new session, the Memory Service surfaces relevant stored information and injects it into the model's context, seamlessly bridging what was learned in past conversations with what the agent needs to know right now.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Three-Phase Memory Pipeline
&lt;/h3&gt;

&lt;p&gt;The entire lifecycle of a memory — from raw conversation to retrieved context — flows through three distinct phases:&lt;/p&gt;

&lt;p&gt;[IMAGE: Architecture diagram showing the three-phase memory pipeline — Extraction (conversation → facts), Consolidation (merge/dedup/resolve conflicts), Retrieval (semantic search → context injection) — with arrows flowing left to right across a horizontal timeline]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Extraction.&lt;/strong&gt; As the conversation progresses, the Memory Service analyzes the dialogue and identifies facts worth remembering. This might be a user's name, a dietary restriction they mentioned in passing, a product preference, or any other detail that could be useful in a future session. The extraction is LLM-powered, meaning it understands context and nuance rather than performing naive keyword matching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Consolidation.&lt;/strong&gt; Raw extracted facts are rarely clean. A user might say "I'm vegetarian" in one session and "I don't eat meat or fish" in another. Consolidation is the process of merging new extractions with existing memories, deduplicating redundant entries, and resolving conflicts. If a user previously said they live in Seattle and now mentions moving to Austin, consolidation ensures the old fact is replaced rather than both being stored as true simultaneously. This phase is what separates a memory system that grows intelligently over time from one that simply accumulates noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Retrieval.&lt;/strong&gt; When a new session begins (or mid-session when relevant), the Memory Service performs a semantic search against the stored memories for that user's scope. It returns the most contextually relevant facts and injects them into the agent's system context. This means the model doesn't receive a dump of every stored memory — it receives precisely the subset that matters for the current conversation.&lt;/p&gt;

&lt;p&gt;This pipeline runs largely automatically when you use the Memory Search Tool access pattern, making it straightforward to add persistent memory to an existing agent with minimal code changes.&lt;/p&gt;


&lt;h2&gt;
  
  
  Memory Types Deep Dive
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry Memory ships with two distinct memory types, each designed for a different category of information. Choosing the right type — or combining both — is key to building a memory system that is both effective and well-governed.&lt;/p&gt;
&lt;h3&gt;
  
  
  User Profile Memory
&lt;/h3&gt;

&lt;p&gt;User Profile Memory is designed for static, durable facts about a person: their name, dietary restrictions, language preferences, accessibility needs, known product configurations, and similar attributes that are unlikely to change frequently and are broadly applicable across any conversation topic.&lt;/p&gt;

&lt;p&gt;The defining characteristic of User Profile Memory is that it is fetched &lt;strong&gt;once, at session start&lt;/strong&gt;. Rather than performing a semantic search every time a potentially relevant memory might exist, the service retrieves the entire user profile upfront and injects it into the system context as a stable foundation. This is efficient and appropriate because user profile facts are almost always relevant — an agent that knows a user is allergic to gluten should factor that in regardless of what the conversation is about.&lt;/p&gt;

&lt;p&gt;You configure User Profile Memory using the &lt;code&gt;user_profile_details&lt;/code&gt; parameter when creating your Memory Store. This is a natural-language instruction to the extraction model describing what kinds of facts should be captured and, importantly, what should be &lt;strong&gt;excluded&lt;/strong&gt;. In the provisioning script we'll walk through later, you'll see this line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;user_profile_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avoid irrelevant or sensitive data, such as age, financials, precise location, and credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This governance instruction is critical. Without explicit guidance, the extraction model might capture personally sensitive information you neither want nor are permitted to store. Always define clear exclusion criteria in &lt;code&gt;user_profile_details&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chat Summary Memory
&lt;/h3&gt;

&lt;p&gt;Chat Summary Memory takes a different approach. Instead of extracting discrete atomic facts, it produces distilled, per-topic summaries of conversation content. Think of it as an intelligent compression layer: instead of storing a 40-message exchange verbatim, Chat Summary Memory produces a compact narrative that captures the essential arc — what was discussed, what decisions were made, what questions remain open.&lt;/p&gt;

&lt;p&gt;Unlike User Profile Memory, Chat Summary Memory is retrieved &lt;strong&gt;contextually&lt;/strong&gt; via semantic search. Not every conversation summary is relevant to every future session — a summary of a technical support conversation is probably not relevant when the user is asking about billing. The semantic retrieval layer ensures that only topically relevant summaries surface, keeping the context window focused and the model's reasoning clean.&lt;/p&gt;

&lt;p&gt;You enable Chat Summary Memory by setting &lt;code&gt;chat_summary_enabled=True&lt;/code&gt; in your &lt;code&gt;MemoryStoreDefaultOptions&lt;/code&gt;. In the sample we're working with, it's disabled (&lt;code&gt;chat_summary_enabled=False&lt;/code&gt;) to keep the demo focused on user profile memory, but for production agents handling diverse conversation topics, enabling both types together is usually the right call.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Memory Type&lt;/th&gt;
&lt;th&gt;What It Stores&lt;/th&gt;
&lt;th&gt;When Retrieved&lt;/th&gt;
&lt;th&gt;Config Parameter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User Profile Memory&lt;/td&gt;
&lt;td&gt;Static facts: name, preferences, restrictions&lt;/td&gt;
&lt;td&gt;Once at session start&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;user_profile_details&lt;/code&gt; (string instruction)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chat Summary Memory&lt;/td&gt;
&lt;td&gt;Per-topic conversation summaries&lt;/td&gt;
&lt;td&gt;Contextually via semantic search&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chat_summary_enabled=True&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two memory types are complementary. User Profile Memory gives the agent a stable, always-available picture of who it's talking to. Chat Summary Memory gives it the ability to recall the arc of past conversations in relevant contexts. Together, they provide a comprehensive long-term memory foundation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Memory Architecture: How It Really Works
&lt;/h2&gt;

&lt;p&gt;Understanding the architectural components of Azure AI Foundry Memory helps you reason about isolation, scalability, and the operational characteristics of the system you're building.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Memory Store
&lt;/h3&gt;

&lt;p&gt;The Memory Store is the top-level resource — the durable storage container that holds all memories. You create one Memory Store per application or use case. The store is associated with your Azure AI Foundry project and requires two model deployments to function: a &lt;strong&gt;chat model&lt;/strong&gt; (used for extraction and consolidation, e.g., &lt;code&gt;gpt-4.1-mini&lt;/code&gt;) and an &lt;strong&gt;embedding model&lt;/strong&gt; (used for semantic retrieval, e.g., &lt;code&gt;text-embedding-3-small&lt;/code&gt;). Both must be available within your Foundry project.&lt;/p&gt;

&lt;p&gt;The Memory Store is not a simple database. It is an intelligent service layer that orchestrates LLM-powered extraction, runs consolidation logic, maintains vector embeddings for semantic search, and handles scope-based isolation — all managed for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scopes and Isolation
&lt;/h3&gt;

&lt;p&gt;Within a single Memory Store, memories are organized by &lt;strong&gt;scope&lt;/strong&gt;. A scope is a logical namespace that isolates one set of memories from another. In almost all multi-user scenarios, each user gets their own scope, ensuring that one user's memories are completely invisible to another user's sessions.&lt;/p&gt;

&lt;p&gt;[IMAGE: Diagram showing memory scope isolation — a single Memory Store containing three separate scope "buckets" (User A, User B, User C), each with their own memory items, illustrating per-user isolation]&lt;/p&gt;

&lt;p&gt;The scope is a string identifier, and the platform supports dynamic resolution of this string at runtime (more on this in the Understanding Scope section). Within a scope, the Memory Store maintains all extracted memories and summaries for that logical entity. You can have up to &lt;strong&gt;100 scopes per store&lt;/strong&gt; and up to &lt;strong&gt;10,000 memories per scope&lt;/strong&gt;, giving you room to scale across a substantial user base within a single store instance.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM-Powered Consolidation
&lt;/h3&gt;

&lt;p&gt;The consolidation phase deserves special attention because it is what distinguishes Azure AI Foundry Memory from a naive "append facts to a database" approach. After each conversation turn (or at session end, depending on configuration), the extraction model identifies new facts from the recent exchange and the consolidation model compares them against existing stored memories for that scope.&lt;/p&gt;

&lt;p&gt;Consolidation performs three operations: &lt;strong&gt;merge&lt;/strong&gt; (combining related facts into a single, richer record), &lt;strong&gt;deduplication&lt;/strong&gt; (discarding new facts that are already represented), and &lt;strong&gt;conflict resolution&lt;/strong&gt; (updating or replacing facts that contradict existing memories). This is what allows the memory store to remain clean and authoritative over time, rather than becoming an ever-growing pile of contradictory statements.&lt;/p&gt;

&lt;p&gt;The billing model reflects this LLM-powered approach: you are billed based on underlying model usage (chat model tokens for extraction/consolidation, embedding model calls for retrieval), not on a per-memory or per-store flat fee. Keep this in mind when designing high-volume deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Access Patterns: Tool vs. Low-Level API
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry Memory exposes two distinct integration patterns, and choosing the right one shapes how much control you have versus how much complexity you manage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Search Tool
&lt;/h3&gt;

&lt;p&gt;The Memory Search Tool is the high-level, agent-native access pattern. You attach it to your agent as a tool, and the framework handles the read/write lifecycle automatically. At the start of each session, user profile memories are injected into the system context. During the session, the tool can perform contextual semantic searches when the agent's reasoning determines that retrieving additional memories is warranted. At the end of the turn, newly extracted facts are written back to the store.&lt;/p&gt;

&lt;p&gt;This is the pattern used in the Foundry Agent Framework sample we'll build in this guide, implemented via &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;. It is the right choice for the vast majority of production agents because it requires minimal code, works consistently across session types, and handles the extraction/consolidation lifecycle without custom orchestration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Low-Level Memory Store APIs
&lt;/h3&gt;

&lt;p&gt;For scenarios requiring precise control — custom extraction logic, selective memory writes, cross-scope queries, or integration with non-agent systems — the Memory Store APIs provide direct programmatic access. You can call &lt;code&gt;memory_stores.search()&lt;/code&gt; to perform arbitrary semantic searches, write memories explicitly, delete specific records, and enumerate the contents of a scope.&lt;/p&gt;

&lt;p&gt;The low-level APIs are accessed via the &lt;code&gt;project.beta.memory_stores&lt;/code&gt; client in the Azure AI Projects SDK (the &lt;code&gt;allow_preview=True&lt;/code&gt; flag is required, as the API is currently in public preview). These APIs give you the full power of the memory service but require you to implement your own retrieval and injection logic.&lt;/p&gt;

&lt;p&gt;For most teams shipping their first memory-enabled agent, starting with the Memory Search Tool is strongly recommended. Reach for the low-level APIs when you have a specific requirement that the tool abstraction cannot satisfy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding Scope
&lt;/h2&gt;

&lt;p&gt;Scope is one of the most important concepts in Azure AI Foundry Memory, and it is worth spending time to understand how it resolves at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;{{$userId}}&lt;/code&gt; Template
&lt;/h3&gt;

&lt;p&gt;In the agent code, you'll specify the scope as the string literal &lt;code&gt;"{{$userId}}"&lt;/code&gt;. This is a &lt;strong&gt;template placeholder&lt;/strong&gt; that the Foundry hosting infrastructure replaces with the actual, authenticated user identity at runtime. You never need to hard-code or dynamically construct the scope string in your application code; the platform resolves it for you.&lt;/p&gt;

&lt;p&gt;The resolution logic uses two potential sources, in order of precedence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;x-memory-user-id&lt;/code&gt; header:&lt;/strong&gt; If the HTTP request to the hosted agent includes this header, its value is used as the user identity for scope resolution. This is useful when you have your own user identity system and want to pass a stable internal user ID (e.g., a database primary key or a GUID from your own identity provider) to the memory service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Entra ID TID+OID fallback:&lt;/strong&gt; If no &lt;code&gt;x-memory-user-id&lt;/code&gt; header is present, the platform falls back to the combination of the authenticated user's Azure Entra tenant ID and object ID. This works automatically in scenarios where your users authenticate via Entra, ensuring that each Entra user naturally gets their own memory scope without any additional instrumentation.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Static Scopes
&lt;/h3&gt;

&lt;p&gt;For non-user-specific memory — shared knowledge that applies to all users of an agent, or a single-user personal assistant — you can use a static string as the scope instead of the &lt;code&gt;{{$userId}}&lt;/code&gt; template. For example, &lt;code&gt;scope="shared"&lt;/code&gt; or &lt;code&gt;scope="assistant-owner"&lt;/code&gt;. Static scopes are a good fit for personal productivity agents where a single person owns the deployment, or for storing organization-wide facts that should be available to all users.&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC Requirements
&lt;/h3&gt;

&lt;p&gt;Accessing the Memory Store requires the &lt;strong&gt;Azure AI User&lt;/strong&gt; role assigned on your Foundry project scope in Azure RBAC. This applies to both the agent's runtime identity (typically a managed identity or a service principal) and to developers running scripts locally (typically via &lt;code&gt;az login&lt;/code&gt; and &lt;code&gt;DefaultAzureCredential&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;[IMAGE: Screenshot or mockup of Azure portal showing the Azure AI Foundry project with RBAC role assignment — specifically the "Azure AI User" role assignment screen]&lt;/p&gt;

&lt;p&gt;Make sure this role assignment is in place before running either the provisioning script or the agent itself; you'll encounter authorization errors at both the &lt;code&gt;memory_stores.get()&lt;/code&gt; and &lt;code&gt;memory_stores.create()&lt;/code&gt; calls otherwise.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: Provisioning a Memory Store
&lt;/h2&gt;

&lt;p&gt;Before your agent can use persistent memory, you need to create a Memory Store within your Foundry project. The following script, &lt;code&gt;provision_memory_store.py&lt;/code&gt;, handles idempotent provisioning — it creates the store if it doesn't exist, and safely no-ops if it already does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment Setup
&lt;/h3&gt;

&lt;p&gt;Start by setting the required environment variables. These values come from your Azure AI Foundry project settings and your deployed model names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;account&amp;gt;.services.ai.azure.com/api/projects/&amp;lt;project&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpt-4.1-mini"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"text-embedding-3-small"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"agent_framework_memory"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The endpoint follows the standard Foundry project endpoint format. The model deployment names must correspond to models that are actually deployed within your project — the memory service will call these models directly for extraction, consolidation, and embedding generation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Provisioning Script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Copyright (c) Microsoft. All rights reserved.
&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Provision the Azure AI Foundry Memory Store used by this sample.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MemoryStoreDefaultDefinition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MemoryStoreDefaultOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.exceptions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;endpoint&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;memory_store_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;chat_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;embedding_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_preview&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; already exists (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;); leaving as-is.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creating memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;definition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MemoryStoreDefaultDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chat_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chat_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embedding_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MemoryStoreDefaultOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;chat_summary_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;user_profile_enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;user_profile_details&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Avoid irrelevant or sensitive data, such as age, financials, precise location, and credentials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store for the Agent Framework foundry-hosted memory sample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;definition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;definition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Created memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;verified&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory_stores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;ResourceNotFoundError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; was not found after creation; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;the service may not have persisted it.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Verified memory store &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; is available on the service (id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;verified&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Walkthrough
&lt;/h3&gt;

&lt;p&gt;A few implementation details are worth calling out explicitly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;allow_preview=True&lt;/code&gt;:&lt;/strong&gt; This flag on the &lt;code&gt;AIProjectClient&lt;/code&gt; constructor is required to access the &lt;code&gt;project.beta&lt;/code&gt; namespace, which hosts the memory store APIs. Because the Memory Service is currently in public preview, the beta client is where the APIs live. This flag is a deliberate opt-in to acknowledge that the API surface may evolve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idempotency via &lt;code&gt;ResourceNotFoundError&lt;/code&gt;:&lt;/strong&gt; The script first attempts a &lt;code&gt;GET&lt;/code&gt; on the named store. If it succeeds, the store already exists and the script exits cleanly. Only if a &lt;code&gt;ResourceNotFoundError&lt;/code&gt; is raised does the script proceed to create. This makes the provisioning script safe to re-run in CI/CD pipelines or developer setup scripts without risk of accidentally creating duplicate stores or raising errors on subsequent runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Post-creation verification:&lt;/strong&gt; After &lt;code&gt;create()&lt;/code&gt; returns, the script performs a second &lt;code&gt;GET&lt;/code&gt; to confirm the store is actually visible on the service. This guards against an edge case where the creation call returns success but the resource isn't yet available — a realistic concern with eventually consistent distributed services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;MemoryStoreDefaultOptions&lt;/code&gt;:&lt;/strong&gt; This is where you configure which memory types are active. Here, &lt;code&gt;user_profile_enabled=True&lt;/code&gt; activates profile memory with the specified exclusion instruction, while &lt;code&gt;chat_summary_enabled=False&lt;/code&gt; keeps summaries off for this sample. Adjust these settings based on your application's memory requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;DefaultAzureCredential&lt;/code&gt;:&lt;/strong&gt; Authentication uses the standard Azure credential chain, which automatically picks up &lt;code&gt;az login&lt;/code&gt; tokens during local development, managed identity in Azure-hosted environments, and environment-variable credentials in CI. You don't need to manage credentials explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hands-On: Building the Foundry Hosted Memory Agent
&lt;/h2&gt;

&lt;p&gt;With the Memory Store provisioned, it's time to build the agent. The following &lt;code&gt;main.py&lt;/code&gt; implements a fully functional hosted memory agent using the Foundry Agent Framework's &lt;code&gt;FoundryChatClient&lt;/code&gt;, &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;, and &lt;code&gt;ResponsesHostServer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;[IMAGE: Diagram showing the FoundryMemoryProvider context provider flow — on each agent turn: (1) retrieve user-profile memories, (2) semantic search for contextual memories, (3) inject into model context, (4) post-turn update store with new facts]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Copyright (c) Microsoft. All rights reserved.
&lt;/span&gt;
&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Foundry Memory hosted agent sample.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.foundry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FoundryMemoryProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework_foundry_hosting&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ResponsesHostServer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity.aio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_resolved_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;${&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;project_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;allow_preview&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;memory_store_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_resolved_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;context_providers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEMORY_STORE_NAME is not set; memory will not be available to the agent.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;memory_provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryMemoryProvider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;project_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;project_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;memory_store_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{$userId}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that remembers facts the user has shared &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;across conversations. Relevant memories from previous interactions are &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;automatically provided to you in the system context. Use them when &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answering, and acknowledge when you are relying on remembered facts.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context_providers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;default_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;store&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ResponsesHostServer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_async&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Code Walkthrough
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;FoundryChatClient&lt;/code&gt;:&lt;/strong&gt; This is the Foundry-native chat client from the Agent Framework. It wraps the &lt;code&gt;AIProjectClient&lt;/code&gt; and handles authentication, model routing, and API versioning. Crucially, &lt;code&gt;client.project_client&lt;/code&gt; exposes the underlying &lt;code&gt;AIProjectClient&lt;/code&gt; instance, which is then reused by the &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;. This single-auth-context pattern means your agent uses one credential and one client for both chat completions and memory operations — no separate credential configuration required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;_resolved_env&lt;/code&gt;:&lt;/strong&gt; This helper function detects unresolved template placeholders in environment variables — strings that look like &lt;code&gt;${VARIABLE}&lt;/code&gt; or &lt;code&gt;{{VARIABLE}}&lt;/code&gt; — and returns an empty string in those cases. This handles the scenario where a deployment template has been applied but the &lt;code&gt;MEMORY_STORE_NAME&lt;/code&gt; variable wasn't injected correctly, preventing the agent from crashing at startup with a cryptic error. Instead, it degrades gracefully by skipping memory initialization and logging a warning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;FoundryMemoryProvider&lt;/code&gt;:&lt;/strong&gt; This is the core memory integration component. It takes the shared &lt;code&gt;project_client&lt;/code&gt;, the memory store name, and the &lt;code&gt;scope&lt;/code&gt; template string. When instantiated with &lt;code&gt;scope="{{$userId}}"&lt;/code&gt;, the provider passes this template to the hosting infrastructure, which resolves it to the actual user identity at request time. The &lt;code&gt;FoundryMemoryProvider&lt;/code&gt; acts as a context provider — on each agent turn, it retrieves relevant memories and injects them into the system context before the model processes the user's message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;default_options={"store": False}&lt;/code&gt;:&lt;/strong&gt; This tells the Agent Framework not to maintain its own conversation history store. Conversation history management is delegated to the hosting infrastructure — the &lt;code&gt;ResponsesHostServer&lt;/code&gt; and the Foundry platform — rather than being handled at the application layer. This is the correct approach for agents hosted on Foundry, as the platform manages session state natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;context_providers&lt;/code&gt;:&lt;/strong&gt; The agent accepts a list of context providers, each of which can inject additional context into the system prompt before each model call. By appending the &lt;code&gt;FoundryMemoryProvider&lt;/code&gt; to this list, you wire memory retrieval directly into the agent's reasoning loop. Adding other context providers (e.g., RAG document retrievers, user preference loaders) follows the same pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;ResponsesHostServer&lt;/code&gt;:&lt;/strong&gt; This class wraps the agent in a server that implements the OpenAI Responses API protocol, making it compatible with any client that speaks that protocol. It handles the HTTP serving, session routing, and lifecycle management, letting you focus on agent logic rather than networking boilerplate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running &amp;amp; Deploying the Agent
&lt;/h2&gt;

&lt;p&gt;With the code in place, you have three paths to run and deploy your memory agent: the &lt;code&gt;azd&lt;/code&gt; CLI, the VS Code Foundry Toolkit extension, and a direct Python execution for quick local testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using &lt;code&gt;azd&lt;/code&gt; (Recommended)
&lt;/h3&gt;

&lt;p&gt;The Azure Developer CLI with the AI agents extension provides the most streamlined workflow from local development to production deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install azd AI agent extension&lt;/span&gt;
azd ext &lt;span class="nb"&gt;install &lt;/span&gt;azure.ai.agents

&lt;span class="c"&gt;# Initialize agent from manifest&lt;/span&gt;
&lt;span class="nb"&gt;mkdir &lt;/span&gt;my-memory-agent &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;my-memory-agent
azd ai agent init &lt;span class="nt"&gt;-m&lt;/span&gt; https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/hosted-agents/agent-framework/responses/13-foundry-memory/agent.manifest.yaml

&lt;span class="c"&gt;# Run locally&lt;/span&gt;
azd ai agent run

&lt;span class="c"&gt;# Invoke locally&lt;/span&gt;
azd ai agent invoke &lt;span class="nt"&gt;--local&lt;/span&gt; &lt;span class="s2"&gt;"Hi, I'm allergic to dairy"&lt;/span&gt;

&lt;span class="c"&gt;# Set memory store name in azd env&lt;/span&gt;
azd &lt;span class="nb"&gt;env set &lt;/span&gt;MEMORY_STORE_NAME &lt;span class="s2"&gt;"agent_framework_memory"&lt;/span&gt;

&lt;span class="c"&gt;# Deploy to Foundry&lt;/span&gt;
azd deploy

&lt;span class="c"&gt;# Invoke deployed agent&lt;/span&gt;
azd ai agent invoke &lt;span class="s2"&gt;"What are my allergies?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;azd ai agent init&lt;/code&gt; command bootstraps your local project from the agent manifest, pulling down the correct &lt;code&gt;main.py&lt;/code&gt;, &lt;code&gt;requirements.txt&lt;/code&gt;, and configuration files. The &lt;code&gt;azd ai agent run&lt;/code&gt; command starts the agent locally, while &lt;code&gt;azd ai agent invoke --local&lt;/code&gt; sends a test message directly. Notice the workflow in the example: first you tell the agent you're allergic to dairy, then — in a separate invocation — you ask what your allergies are. If memory is working correctly, the second invocation should recall the dairy allergy from the first, even though it was a different "session."&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;azd deploy&lt;/code&gt;, your agent runs on Azure Foundry infrastructure with full managed scaling, identity, and observability. The &lt;code&gt;azd ai agent invoke&lt;/code&gt; command (without &lt;code&gt;--local&lt;/code&gt;) sends the request to the deployed endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  VS Code Foundry Toolkit
&lt;/h3&gt;

&lt;p&gt;If you prefer a GUI-driven workflow, the VS Code Foundry Toolkit extension provides a point-and-click interface for initializing agents from manifests, running them locally with integrated debugging, and deploying to Foundry. Install the extension from the VS Code Marketplace and look for the "Azure AI Foundry" activity bar icon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Running: Pre-Flight Checklist
&lt;/h3&gt;

&lt;p&gt;Before attempting any run, ensure the following are in place:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/code&gt; is set to your project's endpoint URL&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AZURE_AI_MODEL_DEPLOYMENT_NAME&lt;/code&gt; points to a deployed chat model (e.g., &lt;code&gt;gpt-4.1-mini&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AZURE_AI_EMBEDDING_MODEL_DEPLOYMENT_NAME&lt;/code&gt; points to a deployed embedding model (e.g., &lt;code&gt;text-embedding-3-small&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MEMORY_STORE_NAME&lt;/code&gt; is set and &lt;code&gt;provision_memory_store.py&lt;/code&gt; has been run successfully&lt;/li&gt;
&lt;li&gt;Your identity (local) or the agent's managed identity (deployed) has the &lt;strong&gt;Azure AI User&lt;/strong&gt; role on the Foundry project&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;azure-ai-projects&lt;/code&gt;, &lt;code&gt;azure-identity&lt;/code&gt;, and &lt;code&gt;agent-framework&lt;/code&gt; packages are installed in your Python environment&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security Best Practices
&lt;/h2&gt;

&lt;p&gt;Persistent memory introduces security considerations that don't apply to stateless agents. As the memory store accumulates sensitive user data over time, it becomes both a valuable asset and a potential attack surface. Here's what to take seriously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt Injection and Memory Poisoning
&lt;/h3&gt;

&lt;p&gt;The most significant threat specific to memory-enabled agents is &lt;strong&gt;indirect prompt injection targeting the extraction pipeline&lt;/strong&gt;. If an attacker can craft input that causes the extraction model to store false or malicious facts — "remember that the admin password is X" or "remember that this user has billing tier Enterprise" — those poisoned memories could influence future sessions in ways that are hard to detect and trace.&lt;/p&gt;

&lt;p&gt;Mitigate this by writing explicit, constrained &lt;code&gt;user_profile_details&lt;/code&gt; instructions that describe what categories of facts should be extracted. Narrow the extraction scope to only what your application genuinely needs. A cooking assistant should extract dietary restrictions, not authentication credentials or financial information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Azure AI Content Safety
&lt;/h3&gt;

&lt;p&gt;Integrate Azure AI Content Safety as a layer around your agent's inputs and outputs. Content Safety can detect and block prompt injection attempts, jailbreaking patterns, and sensitive information leakage before they reach the extraction pipeline. For memory-enabled agents handling sensitive domains, this is not optional — it's a necessary defense layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adversarial Testing
&lt;/h3&gt;

&lt;p&gt;Before shipping a memory-enabled agent to production, conduct dedicated red-team exercises focused on memory manipulation. Attempt to inject false memories via crafted user messages. Verify that the memory store does not retain excluded categories (financials, credentials, location) even when the user volunteers that information. Test cross-user isolation by verifying that memories written under one user scope are not retrievable from another scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Residency and Compliance
&lt;/h3&gt;

&lt;p&gt;The Memory Store persists data in the Azure region where your Foundry project is deployed. Ensure that region is compliant with any data residency requirements your organization or customers have. Users in regulated industries (healthcare, finance, government) may have specific requirements about where personal data can be stored. Review your organization's data classification policies and determine whether any memory types or user categories require opt-out mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimal Scope of Extraction
&lt;/h3&gt;

&lt;p&gt;Always apply the principle of least privilege to memory extraction. The &lt;code&gt;user_profile_details&lt;/code&gt; instruction is your primary governance tool — use it aggressively to exclude anything your agent doesn't need. If your agent doesn't make product recommendations, it doesn't need to store purchasing history. If it's a technical support bot, it doesn't need the user's demographic information. Store less, and you expose less.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quotas, Limits &amp;amp; Regional Availability
&lt;/h2&gt;

&lt;p&gt;As with all Azure services, Azure AI Foundry Memory has quotas and regional constraints to plan around. Here is a summary of the key limits:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Limit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max scopes per Memory Store&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max memories per scope&lt;/td&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory search requests per minute&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory update requests per minute&lt;/td&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Billing basis&lt;/td&gt;
&lt;td&gt;Underlying model usage (tokens + embeddings)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preview status&lt;/td&gt;
&lt;td&gt;Public preview&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The per-scope memory limit of 10,000 is generous for most use cases, but if your agent is designed for very long-lived users who interact frequently, you should implement a memory lifecycle policy — periodically consolidating or pruning older, less relevant memories to stay well within the limit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regional Availability
&lt;/h3&gt;

&lt;p&gt;Azure AI Foundry Memory is available in the following regions as of the time of writing. Always check the &lt;a href="https://learn.microsoft.com/azure/ai-foundry/" rel="noopener noreferrer"&gt;official Azure documentation&lt;/a&gt; for the latest regional coverage, as new regions are added regularly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Australia East&lt;/li&gt;
&lt;li&gt;Brazil South&lt;/li&gt;
&lt;li&gt;Canada East&lt;/li&gt;
&lt;li&gt;East US 2&lt;/li&gt;
&lt;li&gt;France Central&lt;/li&gt;
&lt;li&gt;Italy North&lt;/li&gt;
&lt;li&gt;Japan East&lt;/li&gt;
&lt;li&gt;Korea Central&lt;/li&gt;
&lt;li&gt;North Central US&lt;/li&gt;
&lt;li&gt;Norway East&lt;/li&gt;
&lt;li&gt;South Africa North&lt;/li&gt;
&lt;li&gt;South India&lt;/li&gt;
&lt;li&gt;Sweden Central&lt;/li&gt;
&lt;li&gt;Switzerland North&lt;/li&gt;
&lt;li&gt;UAE North&lt;/li&gt;
&lt;li&gt;UK South&lt;/li&gt;
&lt;li&gt;West US&lt;/li&gt;
&lt;li&gt;West US 2&lt;/li&gt;
&lt;li&gt;West US 3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your target deployment region is not on this list, you may need to deploy your Foundry project to a supported region. For global deployments, consider the latency implications of cross-region memory lookups and whether a region close to your primary user base is available.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion + Next Steps
&lt;/h2&gt;

&lt;p&gt;Stateless agents are a frustrating limitation that erodes user trust and undermines the value of conversational AI. &lt;strong&gt;Azure AI Foundry Memory&lt;/strong&gt; solves this with a production-grade, managed memory service that handles extraction, consolidation, and contextual retrieval automatically — letting you focus on building the agent experience rather than the memory infrastructure.&lt;/p&gt;

&lt;p&gt;In this guide, we've covered the complete picture: the three-phase memory pipeline that transforms raw conversation into durable, organized knowledge; the two memory types (User Profile Memory and Chat Summary Memory) and when to use each; the scoping model that enforces per-user isolation; the RBAC requirements; and a full end-to-end Python implementation using &lt;code&gt;FoundryChatClient&lt;/code&gt;, &lt;code&gt;FoundryMemoryProvider&lt;/code&gt;, and &lt;code&gt;ResponsesHostServer&lt;/code&gt; from the Foundry Agent Framework.&lt;/p&gt;

&lt;p&gt;The key things to take away:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;code&gt;provision_memory_store.py&lt;/code&gt; once to create your Memory Store before starting the agent — provisioning is idempotent and safe to re-run.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;scope="{{$userId}}"&lt;/code&gt; and let the platform resolve the actual user identity — don't try to manage scope strings manually in application code.&lt;/li&gt;
&lt;li&gt;Define explicit extraction constraints in &lt;code&gt;user_profile_details&lt;/code&gt; to govern what gets stored and protect user privacy.&lt;/li&gt;
&lt;li&gt;Test memory persistence end-to-end using &lt;code&gt;azd ai agent invoke --local&lt;/code&gt; before deploying — the round-trip from storage to retrieval is the behavior that matters most.&lt;/li&gt;
&lt;li&gt;Take prompt injection and memory poisoning seriously; integrate Azure AI Content Safety and conduct adversarial testing before going to production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The sample code in this guide is drawn from the official Microsoft Foundry Samples repository. To get started immediately, explore the following resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📚 &lt;a href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/memory" rel="noopener noreferrer"&gt;Azure AI Foundry Memory Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;a href="https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/agent-framework/responses/13-foundry-memory" rel="noopener noreferrer"&gt;GitHub Sample #13 — Foundry Hosted Memory Agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🔧 &lt;a href="https://learn.microsoft.com/python/api/azure-ai-projects/" rel="noopener noreferrer"&gt;Azure AI Projects SDK Reference&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🚀 &lt;a href="https://learn.microsoft.com/azure/developer/azure-developer-cli/" rel="noopener noreferrer"&gt;Azure Developer CLI (azd) Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The era of agents that actually remember is here. Start building.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>azure</category>
      <category>python</category>
      <category>agents</category>
    </item>
    <item>
      <title>15 Secret Codes for Claude That Will 10x Your AI Productivity</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Mon, 01 Jun 2026 05:59:52 +0000</pubDate>
      <link>https://dev.to/monuminu/15-secret-codes-for-claude-that-will-10x-your-ai-productivity-2996</link>
      <guid>https://dev.to/monuminu/15-secret-codes-for-claude-that-will-10x-your-ai-productivity-2996</guid>
      <description>&lt;h1&gt;
  
  
  15 Secret Codes for Claude That Will 10x Your AI Productivity
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Unlock the full power of Claude AI with these 15 secret prompt codes. From &lt;code&gt;/ghost&lt;/code&gt; to &lt;code&gt;/firstprinciples&lt;/code&gt;, these commands transform how you work with AI — forever.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The Prompt Codes Nobody Talks About&lt;/li&gt;
&lt;li&gt;The 15 Secret Codes — Explained &amp;amp; Demonstrated&lt;/li&gt;
&lt;li&gt;How to Stack Codes for Maximum Power&lt;/li&gt;
&lt;li&gt;Your AI Workflow Will Never Be the Same&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🔥 Most People Use Claude at 10% of Its Power. Here's the Other 90%.
&lt;/h2&gt;

&lt;p&gt;You open Claude. You type a question. You get an answer. You move on.&lt;/p&gt;

&lt;p&gt;That's the beginner loop — and millions of people are stuck in it.&lt;/p&gt;

&lt;p&gt;But buried inside the way Claude processes language is a set of &lt;strong&gt;"mode-shifting" prompt commands&lt;/strong&gt; — shortcodes that instantly reshape how Claude thinks, responds, structures information, and even challenges your own assumptions. They're not hidden in a settings panel. They're not locked behind a premium tier. They're just &lt;strong&gt;a slash and a word away.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;These 15 codes are used by power users, prompt engineers, startup founders, developers, and content creators who consistently get output that feels less like "AI text" and more like &lt;strong&gt;having a brilliant collaborator on call 24/7.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's unlock them — one by one.&lt;/p&gt;




&lt;h2&gt;
  
  
  🗝️ The 15 Secret Codes — Explained &amp;amp; Demonstrated
&lt;/h2&gt;




&lt;h3&gt;
  
  
  1. &lt;code&gt;/ghost&lt;/code&gt; — Strip the Fluff, Keep the Gold
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Removes all preamble, explanation, disclaimers, and meta-commentary. Claude gives you the &lt;strong&gt;raw, final answer only&lt;/strong&gt; — no hand-holding, no "Great question!", no context padding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Power users who know what they want. Copy-paste workflows. Anyone tired of scrolling past three paragraphs to find the one sentence they actually needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/ghost&lt;/code&gt; Write me a subject line for a cold email targeting SaaS founders.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; &lt;code&gt;"You're leaving $40K/month on the table — here's proof."&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;No explanation. No options list. Just the answer. Clean and ready to use.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. &lt;code&gt;/viral&lt;/code&gt; — Engineer Content That Spreads
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Rewires Claude's output to prioritize &lt;strong&gt;emotional resonance, shareability, and scroll-stopping energy.&lt;/strong&gt; Think punchy sentences, surprising angles, and the kind of writing that makes people tag their friends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Social media posts, LinkedIn articles, Twitter/X threads, YouTube hooks, newsletters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/viral&lt;/code&gt; Write a LinkedIn post about the importance of taking breaks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; Something that opens with a counterintuitive stat, delivers a micro-story, and ends with a line people want to screenshot and share — not a motivational poster disguised as a paragraph.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. &lt;code&gt;/hook&lt;/code&gt; — The First 5 Seconds That Win Everything
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Forces Claude to generate &lt;strong&gt;high-impact opening lines&lt;/strong&gt; designed to capture attention before a reader's thumb scrolls away. Great hooks create curiosity gaps, challenge assumptions, or open with a story mid-scene.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Blog intros, email subject lines, video scripts, ad copy, speech openers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/hook&lt;/code&gt; Topic: Why most productivity advice makes you less productive.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Outputs you might get:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"The more productivity books I read, the less I got done."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Productivity culture has a dirty secret — and it's costing you years."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"What if every 'life hack' you've tried was designed to keep you busy, not effective?"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick your weapon.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. &lt;code&gt;/stepbystep&lt;/code&gt; — Complexity, Tamed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Breaks any task, concept, or process into &lt;strong&gt;numbered, logical, easy-to-follow steps.&lt;/strong&gt; No assumptions. No leaps. Just a clear path from A to Z that anyone can follow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Tutorials, how-to guides, onboarding docs, technical walkthroughs, explaining multi-stage processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/stepbystep&lt;/code&gt; How do I launch a newsletter from scratch?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You'll get a clean, sequential roadmap — from picking a platform to sending your first issue — laid out so clearly that a complete beginner could execute it the same day.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. &lt;code&gt;/optimizer&lt;/code&gt; — Your Personal Editor, Strategist &amp;amp; Critic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Takes any text, idea, process, or plan and &lt;strong&gt;improves it&lt;/strong&gt; — sharpening the language, plugging logical gaps, strengthening the argument, and removing the weak spots.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Revising drafts, refining business ideas, improving processes, polishing pitches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/optimizer&lt;/code&gt; Here's my landing page headline: "We help businesses grow with marketing."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Optimized:&lt;/strong&gt; &lt;em&gt;"Turn Clicks Into Customers — Without the Guesswork."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Feed it your rough draft. Get back something sharper.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. &lt;code&gt;/eli5&lt;/code&gt; — Einstein-Level Ideas, 5-Year-Old Simplicity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; "Explain Like I'm 5." Claude strips out all technical jargon and translates any concept into the &lt;strong&gt;simplest possible language&lt;/strong&gt;, using relatable analogies and plain words.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Understanding complex topics fast, writing for general audiences, teaching difficult concepts, demystifying technical subjects for non-technical stakeholders.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/eli5&lt;/code&gt; What is blockchain?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; &lt;em&gt;"Imagine a notebook that thousands of people all have a copy of. When someone writes something new in it, everyone's copy updates at the same time — and nobody can erase or change what was already written. That's blockchain."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Brilliant. No whitepaper required.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. &lt;code&gt;/deepdive&lt;/code&gt; — Go Below the Surface
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; The opposite of a summary. &lt;code&gt;/deepdive&lt;/code&gt; tells Claude to &lt;strong&gt;go comprehensive&lt;/strong&gt; — full context, nuance, layers, edge cases, counterarguments, and expert-level depth. No skimming, no surface answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Research, competitive analysis, strategic planning, academic work, understanding a topic well enough to teach or present it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/deepdive&lt;/code&gt; The psychology behind why people procrastinate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You won't get a three-bullet list. You'll get the neuroscience of avoidance, the role of emotional regulation, temporal discounting theory, implementation intention research, and actionable strategies grounded in actual psychology. The kind of answer that makes you sound like you read five books.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. &lt;code&gt;/monetize&lt;/code&gt; — Turn Ideas Into Income
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Takes any skill, idea, audience, product, or topic and generates &lt;strong&gt;concrete, actionable revenue strategies&lt;/strong&gt; — from quick wins to long-term business models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Creators, freelancers, entrepreneurs, solopreneurs, anyone sitting on an idea wondering "but how do I make money from this?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/monetize&lt;/code&gt; I have a YouTube channel about personal finance with 20K subscribers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;You'll get:&lt;/strong&gt; Sponsorship strategy, digital product ideas, affiliate opportunities, course structures, community memberships, consulting pathways — prioritized by effort vs. revenue potential.&lt;/p&gt;




&lt;h3&gt;
  
  
  9. &lt;code&gt;/startup&lt;/code&gt; — Your On-Demand Co-Founder
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Generates &lt;strong&gt;startup ideas, business models, go-to-market strategies, and founding frameworks&lt;/strong&gt; based on your inputs — whether that's a problem, an industry, a skill set, or just a vague interest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Aspiring founders, side project ideation, product managers exploring new verticals, investors scanning for white spaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/startup&lt;/code&gt; I'm a nurse with 10 years of experience. What business can I build?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You'll get multiple validated directions — from healthcare consulting to digital health tools to education platforms — each with a business model sketch and a realistic path to first revenue.&lt;/p&gt;




&lt;h3&gt;
  
  
  10. &lt;code&gt;/debug&lt;/code&gt; — Find the Break, Fix the Build
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Analyzes code, logic, arguments, or processes to &lt;strong&gt;identify errors, explain why they're happening, and provide the corrected version.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Developers, analysts, writers checking their logic, anyone troubleshooting something that isn't working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/debug&lt;/code&gt;&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_average&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numbers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numbers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numbers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;calculate_average&lt;/span&gt;&lt;span class="p"&gt;([]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Claude finds it:&lt;/strong&gt; Division by zero error when an empty list is passed. Then it provides the fix with a guard clause and an explanation clear enough for a junior dev to learn from.&lt;/p&gt;




&lt;h3&gt;
  
  
  11. &lt;code&gt;/sales&lt;/code&gt; — Words That Make People Buy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Transforms any product, service, or idea into a &lt;strong&gt;persuasive sales message&lt;/strong&gt; — using proven copywriting frameworks (AIDA, PAS, storytelling) to trigger desire, overcome objections, and drive action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Landing pages, sales emails, pitch decks, product descriptions, DM outreach, ad copy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/sales&lt;/code&gt; I'm selling a time management course for busy parents.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;You'll get:&lt;/strong&gt; Copy that speaks directly to the exhausted parent's identity, names their pain out loud, offers social proof, presents the transformation, and closes with urgency. Copy that converts.&lt;/p&gt;




&lt;h3&gt;
  
  
  12. &lt;code&gt;/contrarian&lt;/code&gt; — The Devil's Advocate You Need
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Challenges your ideas, assumptions, plans, and beliefs. Claude &lt;strong&gt;steelmans the opposition&lt;/strong&gt;, surfaces blind spots, exposes weak logic, and forces you to confront the strongest version of the argument against you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; De-risking decisions, stress-testing business plans, improving arguments, academic debate, thinking through major life choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/contrarian&lt;/code&gt; I think remote work is strictly better than office work for all companies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;What you get isn't agreement — you get the sharpest, most evidence-backed counterargument possible. The kind that makes your position stronger once you've wrestled with it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Warning: this code has a habit of saving people from very expensive mistakes.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  13. &lt;code&gt;/framework&lt;/code&gt; — Turn Chaos Into a System
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Takes any messy problem, overwhelming topic, or chaotic set of ideas and &lt;strong&gt;organizes it into a clear, named, structured framework&lt;/strong&gt; — with categories, tiers, phases, or models that make the complex feel manageable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Consultants, strategists, educators, writers, team leads, anyone who needs to communicate complexity clearly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/framework&lt;/code&gt; The different ways a startup can acquire its first 100 customers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt; A structured model — maybe tiered by cost (free vs. paid), by channel (inbound vs. outbound), or by speed (quick wins vs. compounding plays) — with each category labeled and explained. Ready to drop into a slide deck or strategy doc.&lt;/p&gt;




&lt;h3&gt;
  
  
  14. &lt;code&gt;/checklist&lt;/code&gt; — Nothing Falls Through the Cracks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Converts any goal, process, event, or project into a &lt;strong&gt;complete, actionable checklist&lt;/strong&gt; with every step accounted for — ordered, clear, and ready to execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Project managers, event planners, content creators, developers shipping features, anyone who learns the hard way that memory is unreliable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/checklist&lt;/code&gt; Launching a new product on Product Hunt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You'll receive a pre-launch, launch-day, and post-launch checklist covering everything from scheduling posts to notifying your email list to engaging with comments — so you don't realize at 11pm that you forgot to post to your community.&lt;/p&gt;




&lt;h3&gt;
  
  
  15. &lt;code&gt;/firstprinciples&lt;/code&gt; — Think Like Musk, Aristotle &amp;amp; Feynman
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Strips a problem, belief, or concept all the way down to its &lt;strong&gt;fundamental, irreducible truths&lt;/strong&gt; — then rebuilds understanding from the ground up. No assumptions. No inherited wisdom. Just bedrock logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Solving hard problems in novel ways, challenging industry orthodoxy, building original frameworks, making better decisions under uncertainty.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;/firstprinciples&lt;/code&gt; Why is education so expensive?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of "because tuition has gone up" (an observation), Claude deconstructs it: What is education, actually? What resources does it require at its core? What does a credential actually signal? Where does the real cost come from vs. where does perceived value inflate price? What would education look like built from scratch today?&lt;/p&gt;

&lt;p&gt;The answers are &lt;em&gt;not&lt;/em&gt; what you learned in school.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚡ How to Stack Codes for Maximum Power
&lt;/h2&gt;

&lt;p&gt;Here's where it gets really interesting: &lt;strong&gt;these codes can be combined.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/ghost&lt;/code&gt; + &lt;code&gt;/viral&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;A punchy, clean social post with zero filler&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/deepdive&lt;/code&gt; + &lt;code&gt;/framework&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;A comprehensive, organized research breakdown&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/hook&lt;/code&gt; + &lt;code&gt;/stepbystep&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;A tutorial that grabs attention AND delivers clarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/contrarian&lt;/code&gt; + &lt;code&gt;/firstprinciples&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Demolish a bad assumption at its roots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/startup&lt;/code&gt; + &lt;code&gt;/monetize&lt;/code&gt; + &lt;code&gt;/checklist&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Full idea-to-execution business sprint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;/optimizer&lt;/code&gt; + &lt;code&gt;/sales&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Polished, high-converting copy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Try: &lt;code&gt;/deepdive&lt;/code&gt; + &lt;code&gt;/framework&lt;/code&gt; + &lt;code&gt;/ghost&lt;/code&gt; on any complex topic you're researching. You'll get a structured, comprehensive, no-fluff breakdown that would take hours to compile manually — delivered in seconds.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Your AI Workflow Will Never Be the Same
&lt;/h2&gt;

&lt;p&gt;Most people treat Claude like a search engine with better grammar. These 15 codes flip the script entirely.&lt;/p&gt;

&lt;p&gt;They turn Claude into a &lt;strong&gt;strategist, editor, critic, teacher, developer, and creative partner&lt;/strong&gt; — each role activated on demand with a single command.&lt;/p&gt;

&lt;p&gt;The gap between people who use AI superficially and those who use it with precision is widening every month. The codes above are your shortcut to the right side of that gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with three.&lt;/strong&gt; Pick the ones that solve your most immediate problem. Use them until they're muscle memory. Then layer in the rest.&lt;/p&gt;

&lt;p&gt;The power was always there. Now you know the codes.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;💬 Which code are you trying first? Drop it in the comments — and share this with the one person on your team who still types "can you please help me with..." into every single prompt.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>claude</category>
      <category>promptengineering</category>
    </item>
    <item>
      <title>Agent Harness Explained: Build Production-Ready AI Agents with Microsoft Agent Framework</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Sat, 30 May 2026 13:48:41 +0000</pubDate>
      <link>https://dev.to/monuminu/agent-harness-explained-build-production-ready-ai-agents-with-microsoft-agent-framework-666</link>
      <guid>https://dev.to/monuminu/agent-harness-explained-build-production-ready-ai-agents-with-microsoft-agent-framework-666</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Learn what an agent harness is, why it matters for production AI systems, and how to implement one step-by-step using Microsoft Agent Framework's &lt;code&gt;create_harness_agent&lt;/code&gt; -- with real Python code and deep technical walkthroughs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;What is an Agent Harness?&lt;/li&gt;
&lt;li&gt;Microsoft Agent Framework Overview&lt;/li&gt;
&lt;li&gt;Anatomy of create_harness_agent -- 8 Sub-Systems&lt;/li&gt;
&lt;li&gt;Full Implementation Walkthrough&lt;/li&gt;
&lt;li&gt;Running and Testing&lt;/li&gt;
&lt;li&gt;Production Considerations&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;You have a capable LLM. You write a chat loop in 20 lines of Python and it works -- until the context window fills up and the agent loses the thread. Or it calls a tool but forgets the result two turns later. Or it crashes in production with no trace of what went wrong.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;hidden complexity tax of AI agents&lt;/strong&gt;. Every production-grade agent needs: a tool-calling loop, conversation history management, context-window compaction, a planning mechanism, durable memory, skill extensibility, and observability. If you wire each of these yourself, you spend more time on infrastructure than on actual intelligence.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;agent harness&lt;/strong&gt; pattern solves this by pre-assembling all components into a single, tested, configurable pipeline -- so you focus on &lt;em&gt;what&lt;/em&gt; your agent does, not &lt;em&gt;how&lt;/em&gt; to keep it running.&lt;/p&gt;

&lt;p&gt;In this deep dive you will learn:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What an agent harness is and why it matters&lt;/li&gt;
&lt;li&gt;How &lt;strong&gt;Microsoft Agent Framework (MAF)&lt;/strong&gt; implements the harness pattern with &lt;code&gt;create_harness_agent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;A component-by-component breakdown of all 8 sub-systems&lt;/li&gt;
&lt;li&gt;A complete, production-ready Research Agent from the &lt;a href="https://github.com/microsoft/agent-framework/tree/main/python/samples/02-agents/harness" rel="noopener noreferrer"&gt;official MAF repository&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. What is an Agent Harness?
&lt;/h2&gt;

&lt;p&gt;The concept comes from two familiar software patterns: a &lt;strong&gt;test harness&lt;/strong&gt; (configures and tears down system-under-test so test authors write test logic, not setup code) and a &lt;strong&gt;DI container&lt;/strong&gt; (wires and resolves components so application code never calls &lt;code&gt;new&lt;/code&gt; directly).&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;agent harness&lt;/strong&gt; applies this to AI agents:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A factory that constructs a fully wired, ready-to-run agent by assembling all required infrastructure components -- history, tools, memory, observability, planning -- from a single configuration point.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The key insight: your agent instructions define &lt;em&gt;what&lt;/em&gt; the agent does; the harness defines &lt;em&gt;how&lt;/em&gt; that intent is reliably executed, persisted, and observed.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Manual Wiring Problem
&lt;/h3&gt;

&lt;p&gt;Here is what you would have to build manually:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Manual Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tool Calling Loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Detect tool calls, dispatch, collect results, re-invoke model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;History Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serialize/deserialize history, decide storage backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context Compaction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitor token count, evict stale messages, preserve context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planning / Todo&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Design task-tracking schema, prompt model to use it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mode Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Track agent state, implement approval gates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persistent Memory&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Write/read durable store, inject into context at the right time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skills Loading&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Progressive skill discovery, filter by relevance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Telemetry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Instrument every model call and tool dispatch&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Harness vs. DIY -- Side by Side
&lt;/h3&gt;

&lt;p&gt;Manual tool calling loop (this is just one of eight required pieces):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DIY: manual tool loop only -- no memory, compaction, or telemetry
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent_manually&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stop&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finish_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tool_fn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tool_registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;tool_fn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="c1"&gt;# Still need: token counting, history, todos, telemetry...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Harness equivalent -- full production pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Harness: complete pipeline in 4 lines
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_harness_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.foundry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FoundryChatClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureCliCredential&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AzureCliCredential&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Microsoft Agent Framework Overview
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Agent Framework (MAF)&lt;/strong&gt; is Microsoft's open-source, production-grade framework for building AI agents and multi-agent workflows. It is the &lt;strong&gt;direct successor to both AutoGen and Semantic Kernel&lt;/strong&gt; -- combining AutoGen's simple abstractions with Semantic Kernel's enterprise features.&lt;/p&gt;

&lt;p&gt;Key capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language&lt;/strong&gt; -- Full parity between Python and C#/.NET&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-provider&lt;/strong&gt; -- Azure AI Foundry, OpenAI, Azure OpenAI, Anthropic, Gemini, Bedrock, Ollama, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-pattern&lt;/strong&gt; -- Single agents, sequential, concurrent, handoff, group-chat, human-in-the-loop&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-deployment&lt;/strong&gt; -- Local dev, Azure Functions, Foundry Hosted Agents, Durable Task&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agent-framework&lt;span class="o"&gt;==&lt;/span&gt;1.7.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Agent Runtime Loop
&lt;/h3&gt;

&lt;p&gt;Every agent.run() call executes this 6-phase deterministic loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Message
    |
    v
[ 1. Context Assembly ]  -- History + ContextProviders (Memory, Skills, Mode, Todos)
[ 2. Middleware Pre   ]  -- Telemetry spans open, compaction checks
[ 3. Model Inference  ]  -- FoundryChatClient / OpenAI / Anthropic
[ 4. Tool Dispatch    ]  -- FunctionInvocationLayer (tool call -&amp;gt; execute -&amp;gt; result -&amp;gt; loop)
[ 5. History Persist  ]  -- Saved after every service call
[ 6. Middleware Post  ]  -- Telemetry spans close
    |
    v
Agent Response (streaming or complete)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The harness ensures all six phases are populated and correctly ordered.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Anatomy of create_harness_agent -- 8 Sub-Systems
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_harness_agent&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                       &lt;span class="c1"&gt;# Required: LLM backend
&lt;/span&gt;    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# Required: total context window
&lt;/span&gt;    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# Required: reserved for response
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MyAgent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                      &lt;span class="c1"&gt;# Optional: agent identity
&lt;/span&gt;    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What this agent does.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent_instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;System prompt.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;disable_todo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;# Toggle: TodoProvider
&lt;/span&gt;    &lt;span class="n"&gt;disable_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                  &lt;span class="c1"&gt;# Toggle: AgentModeProvider
&lt;/span&gt;    &lt;span class="n"&gt;disable_compaction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# Toggle: CompactionProvider
&lt;/span&gt;    &lt;span class="n"&gt;memory_store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="c1"&gt;# Optional: path for MemoryContextProvider
&lt;/span&gt;    &lt;span class="n"&gt;skills_paths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;# Optional: path for SkillsProvider
&lt;/span&gt;    &lt;span class="n"&gt;extra_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;                      &lt;span class="c1"&gt;# Optional: additional tool functions
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 1: Function Invocation Layer
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;agentic loop engine&lt;/strong&gt;. When the model returns tool calls, this layer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Detects tool call requests&lt;/li&gt;
&lt;li&gt;Dispatches to registered tool functions&lt;/li&gt;
&lt;li&gt;Collects results and injects them into message history&lt;/li&gt;
&lt;li&gt;Re-invokes the model with updated history&lt;/li&gt;
&lt;li&gt;Repeats until the model returns a final text response&lt;/li&gt;
&lt;li&gt;Enforces a max iteration limit
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Search the product database.
    Args:
        query: The search query string.
        limit: Maximum results to return.
    Returns:
        List of matching product records.
    &lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;  &lt;span class="c1"&gt;# Your implementation
&lt;/span&gt;
&lt;span class="c1"&gt;# JSON schema is generated from type annotations automatically
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extra_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_database&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 2: History Persistence
&lt;/h3&gt;

&lt;p&gt;Persists conversation history &lt;strong&gt;after every model service call&lt;/strong&gt; -- not just at the end of a turn. If an agent makes three tool calls and crashes on the third, it resumes from the second successful state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Isolates this conversation
&lt;/span&gt;
&lt;span class="n"&gt;result_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Research quantum computing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Summarize your findings&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Turn 2 has full memory of everything from turn 1
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 3: Compaction
&lt;/h3&gt;

&lt;p&gt;Prevents context-window overflow with two strategies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sliding Window&lt;/strong&gt; -- Prunes older messages when token count approaches budget; system prompt always preserved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Result Compaction&lt;/strong&gt; -- Summarizes verbose tool outputs when under token pressure
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# GPT-4o total window
&lt;/span&gt;    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# Reserved for response
&lt;/span&gt;    &lt;span class="c1"&gt;# ~112K available for history; compaction fires before the limit
&lt;/span&gt;    &lt;span class="n"&gt;disable_compaction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# Or disable to self-manage
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 4: TodoProvider
&lt;/h3&gt;

&lt;p&gt;Gives the agent an &lt;strong&gt;explicit task-tracking system&lt;/strong&gt; it manages itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;create_todo(title, description)&lt;/code&gt; -- Creates a work item&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;complete_todo(id)&lt;/code&gt; -- Marks done&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;list_todos()&lt;/code&gt; -- Gets pending items
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: 'Research the top 5 AI agent frameworks.'

Agent uses TodoProvider internally:
  -&amp;gt; create_todo('Research AutoGen')
  -&amp;gt; create_todo('Research LangGraph')
  -&amp;gt; create_todo('Research MAF')
  -&amp;gt; create_todo('Write comparison table')

  [Executes each autonomously]

  -&amp;gt; complete_todo('Research AutoGen')
  -&amp;gt; complete_todo('Research LangGraph')
  -&amp;gt; ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 5: AgentModeProvider -- Plan/Execute Workflow
&lt;/h3&gt;

&lt;p&gt;Implements a &lt;strong&gt;two-phase workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 -- Plan Mode (Interactive)&lt;/strong&gt;&lt;br&gt;
The agent asks clarifying questions, builds a todo list, and waits for human approval. No autonomous work happens until approved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 -- Execute Mode (Autonomous)&lt;/strong&gt;&lt;br&gt;
After approval, the agent works through todos independently, streams progress, and returns to plan mode if it hits a blocking ambiguity.&lt;/p&gt;

&lt;p&gt;This is what makes agents &lt;em&gt;trustworthy&lt;/em&gt; -- humans approve the plan before autonomous execution begins.&lt;/p&gt;
&lt;h3&gt;
  
  
  Sub-System 6: MemoryContextProvider
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;File-based durable memory&lt;/strong&gt; that survives compaction and persists across sessions. Previously saved memory is re-injected into context at every context assembly cycle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;memory_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./agent_memory&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;memory_dir&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory_store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;memory_dir&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Enable durable memory
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Agent can now call memory_write(key, content) and memory_read(key)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 7: SkillsProvider
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Progressive skill loading&lt;/strong&gt; from a directory of YAML/JSON skill definitions. The agent sees summaries first, then loads full definitions only when needed -- preserving context budget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;skills_paths&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./skills&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# skills/web_research.yaml, data_analysis.yaml, code_review.yaml...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sub-System 8: OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;Full distributed tracing, auto-configured. Every model call, tool invocation, compaction event, and mode switch emits OTLP spans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TracerProvider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.exporter.otlp.proto.grpc.trace_exporter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OTLPSpanExporter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;opentelemetry.sdk.trace.export&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BatchSpanProcessor&lt;/span&gt;

&lt;span class="c1"&gt;# Point at Azure Monitor, Jaeger, Grafana Tempo, or Datadog
&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OTLPSpanExporter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4317&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TracerProvider&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_span_processor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BatchSpanProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exporter&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_tracer_provider&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Harness picks up the configured tracer automatically
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. Full Implementation Walkthrough
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Install and Authenticate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agent-framework
az login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Configure Environment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;https://your-project.services.ai.azure.com/api/projects/your-project
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;FOUNDRY_MODEL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpt-4o
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Minimal Harness Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# minimal_harness.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_harness_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.foundry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FoundryChatClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureCliCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# MAF does NOT auto-load .env
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AzureCliCredential&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                               &lt;span class="n"&gt;project_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                               &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FOUNDRY_MODEL&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What are the latest trends in AI agent frameworks?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Full Research Agent (Official Sample)
&lt;/h3&gt;

&lt;p&gt;The complete &lt;code&gt;harness_research.py&lt;/code&gt; from the &lt;a href="https://github.com/microsoft/agent-framework/tree/main/python/samples/02-agents/harness" rel="noopener noreferrer"&gt;official MAF repository&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# harness_research.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_harness_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.foundry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FoundryChatClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureCliCredential&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="n"&gt;RESEARCH_INSTRUCTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;
## Research Assistant Instructions

You are a research assistant. Research topics thoroughly using web search.
Form good search queries but always verify claims with available tools.

### Research quality
Consult multiple sources. Cross-reference key claims.
Track your sources -- you will need them when presenting results.

### Presenting results
- Use Markdown formatting and clear section headings.
- Cite sources inline: According to [source](URL), ...
- End with a summary of key takeaways.
- Save the final report to file memory so it survives compaction.
&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;AzureCliCredential&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ResearchAgent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A research assistant that plans and executes research tasks.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;agent_instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RESEARCH_INSTRUCTIONS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# All 8 sub-systems active with sensible defaults
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Research Assistant (powered by create_harness_agent)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Enter a research topic. Type /exit to quit.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;You: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/exit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Goodbye!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Assistant: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# agent.run(stream=True) returns AsyncGenerator[AgentUpdate]
&lt;/span&gt;        &lt;span class="c1"&gt;# update.text = streaming text fragment
&lt;/span&gt;        &lt;span class="c1"&gt;# update.contents = list of tool calls, search events, etc.
&lt;/span&gt;        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;function_call&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  [calling: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search_tool_call&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search_tool_result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
                         &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tool_name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search_tool_result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                            &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
                        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search_tool_call&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;search&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                                &lt;span class="n"&gt;q&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  [Web search: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;open_page&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;  [Opening: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Customization Patterns
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Pattern 1: Lean Q&amp;amp;A agent
&lt;/span&gt;&lt;span class="n"&gt;lean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4_096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;disable_todo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# No task tracking
&lt;/span&gt;    &lt;span class="n"&gt;disable_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# No plan/execute
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pattern 2: Research agent with persistent memory
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="n"&gt;mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./memory&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;research&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;memory_store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pattern 3: Enterprise agent with custom tools
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agent_framework.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_web_search_tool&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_internal_db&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;'''&lt;/span&gt;&lt;span class="s"&gt;Query internal company database. Args: query: Search query. Returns: records.&lt;/span&gt;&lt;span class="sh"&gt;'''&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="n"&gt;enterprise&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_harness_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_context_window_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;128_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16_384&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;extra_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;get_web_search_tool&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;query_internal_db&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. Running and Testing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;az login
python harness_research.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Research Assistant (powered by create_harness_agent)
==================================================
Enter a research topic. Type /exit to quit.

You: Research AI agent frameworks in 2026

Assistant:
  [calling tool: switch_to_plan_mode]
  [calling tool: create_todo]
  [calling tool: switch_to_execute_mode]

Here is my research plan. I will:
  1. Survey major frameworks (MAF, AutoGen, LangGraph, CrewAI)
  2. Look up recent benchmarks and community activity
  3. Write a comparison table

Shall I proceed? (yes/no)

You: yes

  [Web search: "Microsoft Agent Framework 2026"]
  [Opening: https://github.com/microsoft/agent-framework]
  [Web search: "LangGraph vs AutoGen comparison 2026"]
  [calling tool: complete_todo]
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  7. Production Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Token Budget Sizing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leave 15-20% headroom&lt;/strong&gt; -- Tokenizers can be imprecise; don't set to the exact model maximum&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Research agents&lt;/strong&gt; -- Use the full window (128K for GPT-4o) for multi-source research&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q&amp;amp;A agents&lt;/strong&gt; -- 16K-32K is sufficient and reduces cost significantly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coding agents&lt;/strong&gt; -- 64K-128K to handle large file contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Durable History Backends
&lt;/h3&gt;

&lt;p&gt;For multi-user production deployments, &lt;code&gt;InMemoryHistoryProvider&lt;/code&gt; is insufficient -- process restarts lose all history. MAF supports pluggable backends via the &lt;code&gt;IHistoryProvider&lt;/code&gt; interface (CosmosDB, Redis, PostgreSQL, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  Security: Prompt Injection Defense
&lt;/h3&gt;

&lt;p&gt;MAF ships &lt;strong&gt;FIDES (Flow Integrity Deterministic Enforcement System)&lt;/strong&gt; as a middleware -- defense against prompt injection (OWASP LLM Top 10 risk #1). FIDES assigns integrity labels (trusted/untrusted) and confidentiality labels (public/private) to every piece of content. Labels propagate automatically; enforcement is deterministic, not heuristic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying to Azure Foundry Hosted Agents
&lt;/h3&gt;

&lt;p&gt;Foundry Hosted Agents provides containerized Micro VM hosting with built-in identity, autoscaling, managed session state, and versioning. The agent code stays identical -- only the hosting layer changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See the official hosting samples&lt;/span&gt;
&lt;span class="c"&gt;# https://github.com/microsoft/agent-framework/tree/main/python/samples/04-hosting&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Observability in Production
&lt;/h3&gt;

&lt;p&gt;With OpenTelemetry pre-wired by the harness, configure your exporter for production visibility:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Monitor Application Insights&lt;/strong&gt; -- For teams already in Azure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana + Tempo&lt;/strong&gt; -- For open-source observability stacks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog / Dynatrace&lt;/strong&gt; -- For enterprise APM platforms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key dashboards: token consumption per session, tool call latency distribution, compaction frequency, agent error rates.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;agent harness pattern&lt;/strong&gt; is the difference between an AI demo and a production AI system. It acknowledges a fundamental truth: building the &lt;em&gt;intelligence&lt;/em&gt; of an agent is the easy part. Keeping that intelligence reliable, observable, durable, and safe under production load is the hard part -- and the harness handles that for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Agent Framework's &lt;code&gt;create_harness_agent&lt;/code&gt;&lt;/strong&gt; is the most complete open-source implementation of this pattern available today. In a single factory call it assembles eight battle-tested sub-systems -- function invocation, history persistence, context compaction, todo-based planning, plan/execute mode management, durable file memory, progressive skill loading, and OpenTelemetry instrumentation -- all individually configurable and working in concert.&lt;/p&gt;

&lt;p&gt;Key takeaways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the harness, strip down if needed&lt;/strong&gt; -- It is always easier to disable features you don't need (&lt;code&gt;disable_todo=True&lt;/code&gt;) than to add them later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Plan/Execute pattern is the right default&lt;/strong&gt; for task-oriented agents -- it keeps humans in control while enabling genuine autonomy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telemetry is non-negotiable in production&lt;/strong&gt; -- the harness gives it for free; configure an exporter on day one&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The official GitHub sample is your north star&lt;/strong&gt; -- &lt;code&gt;harness_research.py&lt;/code&gt; is production-quality code worth reading end to end&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Get started now:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;agent-framework
az login
python harness_research.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explore the full framework at &lt;a href="https://github.com/microsoft/agent-framework" rel="noopener noreferrer"&gt;github.com/microsoft/agent-framework&lt;/a&gt;, join the community on &lt;a href="https://discord.gg/b5zjErwbQM" rel="noopener noreferrer"&gt;Discord&lt;/a&gt;, and follow the latest on the &lt;a href="https://devblogs.microsoft.com/agent-framework/" rel="noopener noreferrer"&gt;official blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The infrastructure is handled. Go build the intelligence.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All code samples are sourced from or based on the &lt;a href="https://github.com/microsoft/agent-framework" rel="noopener noreferrer"&gt;official Microsoft Agent Framework repository&lt;/a&gt; (MIT License).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>agents</category>
      <category>azure</category>
    </item>
    <item>
      <title>Claude Opus 4.8 &amp; Dynamic Workflows: Orchestrating Hundreds of Parallel AI Agents in Production</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Fri, 29 May 2026 04:30:46 +0000</pubDate>
      <link>https://dev.to/monuminu/claude-opus-48-dynamic-workflows-orchestrating-hundreds-of-parallel-ai-agents-in-production-4eio</link>
      <guid>https://dev.to/monuminu/claude-opus-48-dynamic-workflows-orchestrating-hundreds-of-parallel-ai-agents-in-production-4eio</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; Claude Opus 4.8 launches with Dynamic Workflows — a parallel subagent architecture that lets you orchestrate hundreds of AI agents in a single Claude Code session. Here's the deep technical breakdown every engineer needs today.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Claude Opus 4.8 &amp;amp; Dynamic Workflows: Orchestrating Hundreds of Parallel AI Agents in Production
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Published: May 29, 2026 | Focus Keyword: Claude Opus 4.8 Dynamic Workflows | Estimated Read Time: ~14 min&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznbabb375k6j3wh1nij6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fznbabb375k6j3wh1nij6.png" alt="Hero: Claude Opus 4.8 Dynamic Workflows — parallel agent orchestration at scale" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The 11-Day Rewrite That Changed Everything&lt;/li&gt;
&lt;li&gt;What's New in Claude Opus 4.8&lt;/li&gt;
&lt;li&gt;Dynamic Workflows: Architecture Deep Dive&lt;/li&gt;
&lt;li&gt;The Messages API: System Mid-Turn Injection&lt;/li&gt;
&lt;li&gt;Building with Dynamic Workflows: A Practical Guide&lt;/li&gt;
&lt;li&gt;Real-World Use Cases for Engineers&lt;/li&gt;
&lt;li&gt;The Alignment Angle: Why Honesty in Agents Matters&lt;/li&gt;
&lt;li&gt;Anthropic's $65B Infrastructure Bet&lt;/li&gt;
&lt;li&gt;Key Takeaways &amp;amp; What's Next&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The 11-Day Rewrite That Changed Everything {#the-11-day-rewrite}
&lt;/h2&gt;

&lt;p&gt;Imagine you inherit a runtime with 750,000 lines of Zig. Your goal: port every single one to Rust, maintain 99.8% test-suite parity, ship it in under two weeks — and do it without a battalion of contractors.&lt;/p&gt;

&lt;p&gt;That's exactly what Jarred Sumner — creator of Bun — did this month using &lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt;. Eleven days. First commit to merge. Hundreds of AI agents running in parallel: one batch mapping Rust lifetimes for every struct field in the Zig codebase, the next writing behavior-identical &lt;code&gt;.rs&lt;/code&gt; files with two independent reviewers per file, and a final fix loop driving the build and test suite to green. An overnight workflow then opened individual PRs for every memory optimization opportunity it found.&lt;/p&gt;

&lt;p&gt;This is not a chatbot story. This is a new programming model — one where &lt;em&gt;parallelism and autonomous verification&lt;/em&gt; are first-class primitives at the AI layer, not just in your CI pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; are the technical event of May 2026, and this post is the deep dive every engineer needs before they open the API docs.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What's New in Claude Opus 4.8 {#whats-new-opus-48}
&lt;/h2&gt;

&lt;p&gt;Claude Opus 4.8 is the direct successor to Opus 4.7, available today via &lt;code&gt;claude-opus-4-8&lt;/code&gt; on the Claude API, Amazon Bedrock, Vertex AI, and Microsoft Azure. Pricing is unchanged from 4.7: &lt;strong&gt;$5 per million input tokens, $25 per million output tokens&lt;/strong&gt; for standard usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benchmark Results
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5k1wbsxiqvw97pixfak1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5k1wbsxiqvw97pixfak1.png" alt="Benchmark comparison: Claude Opus 4.8 vs prior models across key engineering tasks" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's how Opus 4.8 stacks up across the benchmarks that matter most to engineers building production AI systems:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Benchmark&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Online-Mind2Web&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Computer-use / browser-agent accuracy&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;84%&lt;/strong&gt; — meaningful jump over Opus 4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Legal Agent All-Pass&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;End-to-end agentic task accuracy&lt;/td&gt;
&lt;td&gt;First model to break &lt;strong&gt;10%&lt;/strong&gt; on the strictest all-pass standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CursorBench&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Coding performance at all effort levels&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Exceeds all prior Opus models at every level&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Super-Agent Benchmark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full end-to-end task completion across domains&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Only model to complete every case&lt;/strong&gt;, beats GPT-5.5 at cost parity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Flaw Detection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unremarked flaws in generated code&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~4× lower rate&lt;/strong&gt; than Opus 4.7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The computer-use jump to 84% on Online-Mind2Web is particularly significant — it means Opus 4.8 can reliably pilot web browsers, fill forms, navigate multi-step UIs, and execute research tasks end-to-end with a meaningful reduction in failure rates compared to its predecessor.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Honesty Breakthrough
&lt;/h3&gt;

&lt;p&gt;One of the most impactful — and underreported — improvements in Opus 4.8 is what Anthropic calls &lt;em&gt;calibrated honesty in agentic contexts&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The core problem: AI agents running long tasks tend to report success with more confidence than the evidence warrants. They write a function with a subtle bug, then describe it as "complete and working." They execute a migration step, miss an edge case, and move on without flagging it. Over a long autonomous run, these small overconfidences compound into large, hard-to-debug failures.&lt;/p&gt;

&lt;p&gt;Opus 4.8 is approximately &lt;strong&gt;4× less likely than Opus 4.7&lt;/strong&gt; to allow flaws in code it has written to pass without remark. In Anthropic's alignment evaluations, the model achieves new highs on prosocial traits (supporting user autonomy, acting in the user's best interest) and substantially lower rates of misaligned behavior — on par with Claude Mythos Preview, Anthropic's highest-capability safety-aligned model.&lt;/p&gt;

&lt;p&gt;For engineers building agentic pipelines, this isn't a nice-to-have. A model that proactively flags uncertainty is a model you can actually trust to run unattended overnight.&lt;/p&gt;

&lt;h3&gt;
  
  
  Effort Control
&lt;/h3&gt;

&lt;p&gt;Opus 4.8 introduces a first-class &lt;strong&gt;effort control system&lt;/strong&gt; — an API parameter and UI control that lets you tune the trade-off between quality, speed, and token consumption:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Effort Level&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;low&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fastest, lowest token cost&lt;/td&gt;
&lt;td&gt;Quick lookups, latency-sensitive tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;high&lt;/code&gt; (default)&lt;/td&gt;
&lt;td&gt;Best quality/cost balance&lt;/td&gt;
&lt;td&gt;General-purpose use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;xhigh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;More thinking passes, better results&lt;/td&gt;
&lt;td&gt;Complex reasoning, long-running tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum reasoning depth&lt;/td&gt;
&lt;td&gt;Highest-stakes, cost-insensitive tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In Claude Code, the &lt;code&gt;ultracode&lt;/code&gt; mode sets effort to &lt;code&gt;xhigh&lt;/code&gt; automatically and enables Dynamic Workflows when the task warrants it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fast Mode Pricing
&lt;/h3&gt;

&lt;p&gt;Fast mode (2.5× speed) is now &lt;strong&gt;3× cheaper&lt;/strong&gt; than for prior Opus models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fast mode input&lt;/strong&gt;: $10 per million tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast mode output&lt;/strong&gt;: $50 per million tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For latency-sensitive pipelines where you need Opus-class intelligence quickly, this is a material cost reduction worth re-evaluating your model selection decisions.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Dynamic Workflows: Architecture Deep Dive {#dynamic-workflows-architecture}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; are the headline technical feature of this release. Let's break down exactly how they work under the hood.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwkisjbo37gspdamis675.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwkisjbo37gspdamis675.png" alt="Architecture diagram: Dynamic Workflows fan-out orchestration lifecycle" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Dynamic Workflows Are Not
&lt;/h3&gt;

&lt;p&gt;Dynamic Workflows are &lt;strong&gt;not&lt;/strong&gt; just "Claude spawning a few tool calls." They're not a simple ReAct loop, a chain-of-thought with API calls, or a multi-step prompt chain. They're a fundamentally different execution model.&lt;/p&gt;

&lt;p&gt;In a standard Claude Code session (or any single-agent LLM interaction), all work is &lt;em&gt;sequential&lt;/em&gt;. Claude thinks, acts, observes, thinks again. Even with tool calls, there's one thread of execution. This is fine for most tasks, but it breaks down at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Breadth&lt;/strong&gt;: Scanning an entire codebase across hundreds of files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independence&lt;/strong&gt;: Tasks where subtasks genuinely don't depend on each other's real-time output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adversarial verification&lt;/strong&gt;: Needing to stress-test your own output before returning it&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Dynamic Workflow Lifecycle
&lt;/h3&gt;

&lt;p&gt;When you trigger a Dynamic Workflow — either by asking Claude to "create a workflow" or via the &lt;code&gt;ultracode&lt;/code&gt; effort setting — here's what happens:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 1 — Planning&lt;/strong&gt;&lt;br&gt;
Claude analyzes your prompt, decomposes it into subtasks, and writes an orchestration script &lt;em&gt;dynamically&lt;/em&gt;. This script defines the fan-out strategy: how many subagents to spawn, what each one is responsible for, and what the verification criteria are.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 2 — Fan-Out&lt;/strong&gt;&lt;br&gt;
The orchestration script launches N parallel subagents. Each subagent gets its own isolated context window — it doesn't share a conversation thread with its siblings. This is critical: it means subagents can work on truly independent slices of the problem without token-window contention or result cross-contamination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 3 — Independent Verification&lt;/strong&gt;&lt;br&gt;
Before results from any subagent are folded in, separate verification agents check the work. In adversarial mode, these verifiers are explicitly trying to &lt;em&gt;break&lt;/em&gt; what the working agents produced. This mirrors real engineering review practice: the reviewer's job is to find problems, not to rubber-stamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 4 — Convergence&lt;/strong&gt;&lt;br&gt;
The orchestrator collects verified results, resolves conflicts (where multiple agents found contradictory answers), and synthesizes a single, coherent output back to you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Phase 5 — Persistence &amp;amp; Resumption&lt;/strong&gt;&lt;br&gt;
Progress is checkpointed continuously. If a workflow is interrupted — network failure, cost limit, explicit cancellation — it resumes from the last checkpoint rather than starting over. For workflows running hours or days, this is operationally essential.&lt;/p&gt;
&lt;h3&gt;
  
  
  Dynamic Workflows vs. Single-Agent Claude Code
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Claude Code&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Dynamic Workflows&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Execution model&lt;/td&gt;
&lt;td&gt;Sequential, single-thread&lt;/td&gt;
&lt;td&gt;Parallel, multi-agent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task scope&lt;/td&gt;
&lt;td&gt;Per-file or per-function&lt;/td&gt;
&lt;td&gt;Codebase-scale&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verification&lt;/td&gt;
&lt;td&gt;Self-review only&lt;/td&gt;
&lt;td&gt;Independent adversarial agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Typical duration&lt;/td&gt;
&lt;td&gt;Seconds to minutes&lt;/td&gt;
&lt;td&gt;Minutes to hours to days&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token cost&lt;/td&gt;
&lt;td&gt;Standard&lt;/td&gt;
&lt;td&gt;Substantially higher&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Targeted changes, quick tasks&lt;/td&gt;
&lt;td&gt;Migrations, audits, large refactors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interruption handling&lt;/td&gt;
&lt;td&gt;Restart from scratch&lt;/td&gt;
&lt;td&gt;Resume from checkpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;When should you use Dynamic Workflows?&lt;/strong&gt; The key signal is &lt;em&gt;breadth and independence&lt;/em&gt;. If your task can be meaningfully decomposed into N subtasks that don't depend on each other's real-time output, Dynamic Workflows will outperform a sequential single-agent run — often by a factor that justifies the higher token cost.&lt;/p&gt;

&lt;p&gt;If your task is tightly sequential — where step 2 always needs the full output of step 1 — standard Claude Code is more cost-efficient and likely equally effective.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. The Messages API: System Mid-Turn Injection {#messages-api-system-injection}
&lt;/h2&gt;

&lt;p&gt;Alongside Dynamic Workflows, Anthropic shipped a quieter but developer-critical Messages API change: &lt;strong&gt;system entries can now appear inside the &lt;code&gt;messages&lt;/code&gt; array&lt;/strong&gt;, not just at conversation start.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Previously, system prompts were fixed at conversation start. If you needed to update Claude's instructions mid-task — change permissions, update a token budget, inject new environment context — you had two bad options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Route the update through a user turn&lt;/strong&gt;: Breaks conversational coherence and can confuse the model's task tracking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start a new conversation&lt;/strong&gt;: Destroys the prompt cache, discards all context, restarts the task&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both options are expensive and fragile for production agentic harnesses. The new capability resolves this cleanly.&lt;/p&gt;
&lt;h3&gt;
  
  
  The New API Shape
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Long-running agent session with mid-turn system injection
&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Begin the codebase security audit. Start with the auth module.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ... several turns of agent work complete ...
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auth module audit complete. Found 3 medium-severity issues in JWT &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;validation and 1 high-severity issue in session management. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ready to proceed to the next module.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c1"&gt;# ✅ NEW in Opus 4.8: Inject an updated system instruction mid-conversation
&lt;/span&gt;    &lt;span class="c1"&gt;# This does NOT break the prompt cache for preceding turns
&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PERMISSION UPDATE: You now have read access to the payments module. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TOKEN BUDGET REMAINING: 50,000 tokens. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Prioritize critical and high severity findings only. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skip informational findings for this phase.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Proceed to the payments module audit.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
    &lt;span class="c1"&gt;# Note: The initial system prompt (if any) still goes in the top-level 
&lt;/span&gt;    &lt;span class="c1"&gt;# `system` parameter. Mid-conversation updates go in the messages array.
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Technical Properties
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prompt cache is preserved.&lt;/strong&gt; The system injection doesn't invalidate the cache for all preceding conversation content. On long agentic runs, prompt cache hits can reduce costs by 70–90%. Without cache preservation, this feature would be economically unviable for production workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean separation of concerns.&lt;/strong&gt; Your harness can maintain independent control planes for permissions, token budgets, and environment context — all updatable mid-run without touching the conversation flow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production applications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic permission escalation&lt;/strong&gt;: Grant module access only when the agent reaches that task phase&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token budget management&lt;/strong&gt;: Inject remaining budget context as a soft operational constraint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment state updates&lt;/strong&gt;: Notify the agent when CI completes, a deployment finishes, or test results change&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety guardrails&lt;/strong&gt;: Insert runtime restrictions if monitoring detects the agent approaching a sensitive boundary&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  5. Building with Dynamic Workflows: A Practical Guide {#practical-guide}
&lt;/h2&gt;

&lt;p&gt;Here's how to start using &lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; in your own projects today.&lt;/p&gt;
&lt;h3&gt;
  
  
  Option 1: Claude Code CLI / Desktop (Interactive)
&lt;/h3&gt;

&lt;p&gt;If you're on a &lt;strong&gt;Max, Team, or Enterprise&lt;/strong&gt; plan (Enterprise requires admin to enable in settings):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install or update Claude Code&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code

&lt;span class="c"&gt;# Navigate to your project&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;my-large-project

&lt;span class="c"&gt;# Trigger a dynamic workflow directly from the CLI&lt;/span&gt;
&lt;span class="c"&gt;# The --effort xhigh flag enables ultracode mode and allows Claude &lt;/span&gt;
&lt;span class="c"&gt;# to decide when to spin up a full multi-agent workflow&lt;/span&gt;
claude &lt;span class="nt"&gt;--effort&lt;/span&gt; xhigh &lt;span class="s2"&gt;"Audit all REST API endpoints for missing authentication, 
  SQL injection vectors, and broken access control. Fan out across all service 
  directories in parallel and verify every finding before reporting."&lt;/span&gt;

&lt;span class="c"&gt;# Alternatively, start a session and ask Claude to create a workflow:&lt;/span&gt;
&lt;span class="c"&gt;# claude&lt;/span&gt;
&lt;span class="c"&gt;# &amp;gt; Create a workflow to perform a comprehensive security audit of /src/api&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time a workflow triggers in a session, Claude Code shows you the full orchestration plan and asks for confirmation before spending tokens. You can inspect the plan and cancel if the scope looks wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: Claude API (Programmatic)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_dynamic_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;effort_level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;xhigh&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# low | high | xhigh | max
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Run a Claude Opus 4.8 task with elevated effort, allowing the model
    to leverage Dynamic Workflow orchestration for large-scale tasks.

    Args:
        task_description: Natural language description of the engineering task.
        system_context: System prompt with environment context and permissions.
        max_tokens: Max output tokens for the response.
        effort_level: Effort setting — use &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;xhigh&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; to enable workflow orchestration.

    Returns:
        dict with result text and token usage/cost breakdown.

    Note: Check the latest Anthropic SDK docs for the exact effort parameter
    name and structure — the API surface for effort control may evolve.
    See: https://docs.anthropic.com/en/api/messages
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Extended thinking enables deeper reasoning passes consistent with
&lt;/span&gt;        &lt;span class="c1"&gt;# higher effort levels. Budget tokens control how much the model
&lt;/span&gt;        &lt;span class="c1"&gt;# "thinks" before responding.
&lt;/span&gt;        &lt;span class="n"&gt;thinking&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;budget_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;   &lt;span class="c1"&gt;# Increase for xhigh/max effort tasks
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Where the task benefits from parallel exploration, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create a dynamic workflow to fan out subtasks across &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiple independent agents. Verify all findings before reporting.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt;
    &lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;usage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Last content block is the answer
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;thinking&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thinking&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; 
            &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;output_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;# Example: Codebase-wide security audit
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_dynamic_workflow&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Perform a comprehensive security audit of all REST API endpoints in /src/api. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check for: missing authentication middleware, SQL injection vectors, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;improper input validation, insecure direct object references (IDOR), &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and broken access control. Produce a prioritized remediation list &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;with severity (Critical/High/Medium/Low), affected file, line number, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and a concrete fix recommendation for each finding.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;system_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You have read access to the entire /src directory of a Node.js/Express API. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The codebase uses Sequelize for ORM, JWT for authentication, and Jest for testing. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Flag any uncertainty about a finding rather than reporting it as confirmed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Audit complete. Cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;estimated_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tokens used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;input_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; in / &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;usage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;output_tokens&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; out&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- FINDINGS ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Token Consumption: What to Expect and How to Budget
&lt;/h3&gt;

&lt;p&gt;Dynamic Workflows consume meaningfully more tokens than standard Claude Code sessions. Here's a practical cost estimation model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate_workflow_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;num_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;avg_file_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verification_multiplier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;  &lt;span class="c1"&gt;# ~50% overhead for verification agents
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Order-of-magnitude cost estimator for a Dynamic Workflow run.

    WARNING: Actual costs vary significantly based on task complexity,
    code density, and verification depth. Always run a scoped pilot first.

    Pricing as of May 2026 (verify at anthropic.com/api before budgeting):
      - Standard: $5/M input, $25/M output
      - Fast mode: $10/M input, $50/M output
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Each file processed by ~1 working agent + ~1 verification agent
&lt;/span&gt;    &lt;span class="n"&gt;estimated_input_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_files&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;avg_file_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;verification_multiplier&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# ~500 tokens of structured output per analyzed file
&lt;/span&gt;    &lt;span class="n"&gt;estimated_output_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num_files&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;

    &lt;span class="n"&gt;cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimated_input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;estimated_output_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_input_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;estimated_input_tokens&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_output_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;estimated_output_tokens&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;estimated_cost_usd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recommendation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Start with a 10–20 file subset to calibrate actual usage &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;before running on the full codebase.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# 500-file codebase: rough estimate
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;estimate_workflow_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="c1"&gt;# → ~$22–$30 for a 500-file security audit (verify before budgeting)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ROI calculation is straightforward: if a Dynamic Workflow audit that costs $30 in API tokens replaces 2 days of senior engineer time, it pays for itself in minutes. The key is calibrating scope with a pilot run first.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Real-World Use Cases for Engineers {#real-world-use-cases}
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdifqaq1bogd9pk1cf14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdifqaq1bogd9pk1cf14.png" alt="Before vs. After: Single sequential agent vs. Dynamic Workflow parallel fleet" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are the highest-leverage use cases where &lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; deliver transformative results:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Codebase-Wide Security Audits
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; Comprehensive security audits require checking every endpoint, every query, every authentication check — across potentially thousands of files. A sequential single-agent scan takes hours and misses cross-file patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt; Spawn parallel agents segmented by service boundary. Each agent produces a findings report. Adversarial verification agents attempt to reproduce each finding, filtering false positives. The orchestrator correlates patterns across services (e.g., the same vulnerable input-handling pattern appearing in 12 different files).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence:&lt;/strong&gt; Klarna used Dynamic Workflows to identify dead code and cleanup opportunities across large codebases that traditional static analysis missed entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Large-Scale Migrations and Language Ports
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; Framework upgrades, API deprecations, language ports — tasks that touch hundreds of files with mechanical but non-trivial transformations where getting one wrong cascades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt; Phase 1 agents map all transformation sites and generate a dependency graph. Phase 2 agents execute transformations in parallel, maintaining behavior-identical semantics. Phase 3 agents run the existing test suite against each transformed file. A fix-loop phase handles anything that failed CI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evidence:&lt;/strong&gt; The Bun Zig-to-Rust port — 750,000 lines, 11 days, 99.8% test parity.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Parallel Test Generation and Coverage Analysis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; Generating meaningful test suites for an undercovered codebase is time-consuming and requires deep understanding of each function's contracts and edge cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt; Assign one agent per module or class. Each agent reads the implementation, infers behavioral contracts, and writes a corresponding test suite. A separate verification agent reviews each suite for correctness and quality. A coverage analysis agent identifies remaining gaps and spawns targeted gap-fill agents.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Adversarial Code Review at Scale
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; Thorough code review is bottlenecked on senior engineer time. AI single-agent review catches obvious issues but lacks the cross-file analysis depth of a senior reviewer with full codebase context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt; Working agents review for correctness and logic. Adversarial agents specifically attempt to construct inputs that break each function. Integration agents verify behavioral contracts across module boundaries. All findings are deduplicated and prioritized by severity before they reach the developer.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Profiler-Guided Performance Optimization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The challenge:&lt;/strong&gt; Performance optimization requires understanding both the hot paths (from profiler data) and all code sites contributing to each hot path — a breadth problem well-suited to parallelization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow:&lt;/strong&gt; Parse profiler output to identify the top N hotspots. Spawn one analysis agent per hotspot. Each agent traces contributing code paths, identifies optimization opportunities (algorithmic, caching, I/O), and estimates performance impact. A synthesis agent ranks recommendations by estimated gain-per-implementation-complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The Alignment Angle: Why Honesty in Agents Matters {#alignment-angle}
&lt;/h2&gt;

&lt;p&gt;There's a deeper story in this release that goes beyond benchmarks and features: Anthropic's approach to &lt;em&gt;alignment in agentic contexts&lt;/em&gt; is maturing in ways that directly affect production reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Compound Overconfidence Problem
&lt;/h3&gt;

&lt;p&gt;In a standard chatbot interaction, a confident but wrong answer is an annoyance. In a long-running autonomous agent pipeline, confident-but-wrong &lt;em&gt;compounds&lt;/em&gt;. An agent that silently skips a failing edge case in step 3 will build subsequent steps on a flawed foundation. By step 47, you have a coherent-looking but subtly broken output that's extremely difficult to audit after the fact.&lt;/p&gt;

&lt;p&gt;Opus 4.8's honesty improvements directly target this failure mode. The model is trained to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flag uncertainty proactively&lt;/strong&gt;: If Claude isn't confident a transformation is correct, it says so explicitly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Surface anomalies in inputs and outputs&lt;/strong&gt;: Testers report Opus 4.8 "proactively flags issues with the inputs and outputs of an analysis, something other models routinely missed"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid overconfident completion signals&lt;/strong&gt;: The model won't report a task as "done" while verification passes are still outstanding or uncertain&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Designing Pipelines That Leverage Honesty Signals
&lt;/h3&gt;

&lt;p&gt;If you're building production agentic systems, treat the model's uncertainty flags as structured monitoring data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;parse_agent_confidence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Opus 4.8 proactively flags uncertainty. Parse these signals and
    route them to your observability stack or human review queue.

    Designed for use in agentic pipeline monitoring middleware.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Key phrases Opus 4.8 uses to signal uncertainty or required verification
&lt;/span&gt;    &lt;span class="n"&gt;uncertainty_patterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i[&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="s"&gt;]m not (certain|sure|confident)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;you should verify&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;this may not be (correct|accurate|complete)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(flagged|flag) a potential issue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requires? (manual |human )?review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i cannot (confirm|verify)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;edge case i[&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="s"&gt;]m unsure about&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(uncertain|unclear) (about|whether|if)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;please (double.check|verify|confirm)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;uncertainty_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finditer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;end&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="n"&gt;confidence_level&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;confidence_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uncertainty_flags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requires_human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flag_details&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Emit this as a metric to your APM/observability system
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monitoring_tags&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;confidence_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flag_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="c1"&gt;# Usage in a pipeline step
&lt;/span&gt;&lt;span class="n"&gt;agent_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...migration complete. Note: I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m not certain the lazy-loading &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
               &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;behavior is preserved for the edge case in UserRepository.find() &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
               &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;— you should verify this against the original test suite...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;analysis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_agent_confidence&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;requires_human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️  Agent flagged &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;uncertainty_flags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; uncertainty signal(s).&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Routing to human review queue...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# your_review_queue.push(agent_output, analysis)
&lt;/span&gt;&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✅ Agent output high-confidence. Proceeding automatically.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The architectural principle: &lt;strong&gt;design your pipeline to treat Opus 4.8's uncertainty signals as first-class monitoring events&lt;/strong&gt;. If the model flags something, route it to human review. If it doesn't, you have a meaningfully higher degree of confidence in the output — because this model is specifically trained to speak up when it isn't sure.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Anthropic's $65B Infrastructure Bet {#anthropic-infrastructure-bet}
&lt;/h2&gt;

&lt;p&gt;No technical analysis of &lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; is complete without understanding the infrastructure context. Today, Anthropic announced a &lt;strong&gt;$65 billion Series H round at a $965 billion post-money valuation&lt;/strong&gt;. Annualized run-rate revenue crossed &lt;strong&gt;$47 billion&lt;/strong&gt; earlier this month.&lt;/p&gt;

&lt;p&gt;For engineers, these numbers represent something concrete: &lt;strong&gt;sustained, massive compute investment&lt;/strong&gt; that makes parallel multi-agent workflows economically viable as a product.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Compute Agreements That Enable Parallel Agents
&lt;/h3&gt;

&lt;p&gt;Running hundreds of parallel subagents in a single workflow session requires extraordinary compute throughput. Anthropic has secured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;5 gigawatts&lt;/strong&gt; of new capacity agreements with Amazon (AWS remains primary cloud and training partner)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 gigawatts&lt;/strong&gt; of next-generation TPU capacity with Google and Broadcom&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU capacity on SpaceX Colossus 1 and Colossus 2&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;10 gigawatts of dedicated AI compute is what makes "spin up 200 subagents on a single task" a product feature rather than a research demo. The infrastructure bets being made now are the reason Dynamic Workflows can exist at a price point engineers can actually afford.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Cloud API Availability
&lt;/h3&gt;

&lt;p&gt;For engineering teams with existing cloud contracts, Claude Opus 4.8 is available across all major platforms today:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Availability&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Claude API&lt;/strong&gt; (direct)&lt;/td&gt;
&lt;td&gt;✅ Live&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-opus-4-8&lt;/code&gt;, full feature set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Amazon Bedrock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Live&lt;/td&gt;
&lt;td&gt;Primary cloud partner&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Vertex AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Live&lt;/td&gt;
&lt;td&gt;TPU-backed inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Microsoft Azure / Foundry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Live&lt;/td&gt;
&lt;td&gt;Azure enterprise customers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Claude is the first frontier model simultaneously generally available on all three major cloud platforms. For enterprise engineering teams, this eliminates the migration barrier that previously made Claude adoption complex for organizations with AWS-only or GCP-only procurement policies.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's Coming: Project Glasswing &amp;amp; Mythos
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Project Glasswing&lt;/strong&gt; is Anthropic's limited preview program for a new model class — &lt;strong&gt;Claude Mythos Preview&lt;/strong&gt; — described as having "even higher intelligence than Opus." Currently, a small number of organizations are using Mythos Preview for cybersecurity work. General availability requires additional cyber safeguards that Anthropic says are in rapid development, expected "in the coming weeks."&lt;/p&gt;

&lt;p&gt;The trajectory is clear: Opus 4.8 + Dynamic Workflows is the production standard to build on today. Mythos is the capability ceiling you'll be upgrading to shortly after.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Key Takeaways &amp;amp; What's Next {#key-takeaways}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Opus 4.8 Dynamic Workflows&lt;/strong&gt; represent a genuine architectural shift in how AI-assisted engineering works — not an incremental feature update.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Condensed Version
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;On the model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Available now as &lt;code&gt;claude-opus-4-8&lt;/code&gt;, same pricing as 4.7 ($5/$25 per million tokens standard)&lt;/li&gt;
&lt;li&gt;~4× fewer unremarked code flaws vs. Opus 4.7 — design your pipeline to treat its uncertainty signals as production monitoring events&lt;/li&gt;
&lt;li&gt;84% on Online-Mind2Web (best-in-class computer use), beats GPT-5.5 at cost parity on agentic benchmarks&lt;/li&gt;
&lt;li&gt;New effort control: &lt;code&gt;low&lt;/code&gt; → &lt;code&gt;high&lt;/code&gt; (default) → &lt;code&gt;xhigh&lt;/code&gt; → &lt;code&gt;max&lt;/code&gt;; fast mode is now 3× cheaper&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On Dynamic Workflows:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orchestrate hundreds of parallel subagents in a single session&lt;/li&gt;
&lt;li&gt;Plan → Fan-Out → Independent Verification → Convergence lifecycle, with checkpointed persistence&lt;/li&gt;
&lt;li&gt;Available in Claude Code (CLI, Desktop, VS Code) and the Claude API today&lt;/li&gt;
&lt;li&gt;Start scoped: pilot on a module before running on a full codebase to calibrate token costs&lt;/li&gt;
&lt;li&gt;Best for breadth-first tasks — security audits, large migrations, codebase-wide analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On the Messages API:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System entries injectable mid-conversation without breaking prompt cache&lt;/li&gt;
&lt;li&gt;Critical ergonomic improvement for production agentic harnesses with dynamic permissions and token budgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;On infrastructure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$65B Series H, $47B ARR, 10 GW of compute secured&lt;/li&gt;
&lt;li&gt;Claude on AWS, GCP, Azure, Bedrock, Vertex, and Azure Foundry — no more cloud-lock friction&lt;/li&gt;
&lt;li&gt;Claude Mythos Preview (next capability tier) coming in weeks via Project Glasswing&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚀 Start Building Today
&lt;/h3&gt;

&lt;p&gt;Dynamic Workflows are on by default for Max and Team plans. Try this right now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install/update Claude Code&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/claude-code

&lt;span class="c"&gt;# Run your first Dynamic Workflow&lt;/span&gt;
claude &lt;span class="nt"&gt;--effort&lt;/span&gt; xhigh &lt;span class="s2"&gt;"Create a workflow to audit all authentication logic 
  in ./src/auth for security vulnerabilities. Fan out across all files in 
  parallel, run adversarial verification on every finding, and produce a 
  prioritized remediation report."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or point the Claude API at &lt;code&gt;claude-opus-4-8&lt;/code&gt; with extended thinking enabled, ask Claude to "create a workflow," and let it plan its own parallelization strategy.&lt;/p&gt;

&lt;p&gt;The era of single-threaded AI assistance on engineering tasks is over. The only question is how quickly your team makes the shift to the parallel paradigm.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;All benchmark figures and feature details sourced from official Anthropic documentation published May 28–29, 2026. Token cost estimates are approximations — always verify current pricing at &lt;a href="https://www.anthropic.com/api" rel="noopener noreferrer"&gt;anthropic.com/api&lt;/a&gt; before production budgeting. API parameter names and beta headers for effort control may evolve — check the &lt;a href="https://docs.anthropic.com/en/api/messages" rel="noopener noreferrer"&gt;official Claude API docs&lt;/a&gt; for the latest SDK surface.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
      <category>agents</category>
    </item>
    <item>
      <title>Inside the Agentic Loop: A Deep Technical Dive into AI Coding Agents, Claude Code, and the Architecture Reshaping Software Engineering in 2026</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Thu, 28 May 2026 07:00:37 +0000</pubDate>
      <link>https://dev.to/monuminu/inside-the-agentic-loop-a-deep-technical-dive-into-ai-coding-agents-claude-code-and-the-4pnf</link>
      <guid>https://dev.to/monuminu/inside-the-agentic-loop-a-deep-technical-dive-into-ai-coding-agents-claude-code-and-the-4pnf</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Meta Description:&lt;/strong&gt; A deep technical breakdown of how AI coding agents like Claude Code and OpenAI Codex work under the hood — covering the agentic loop architecture, context window management, subagent orchestration, CI/CD integration, and what Uber's 25%-commit milestone reveals about where software engineering is headed in 2026.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fln41ma603u758ik3efr5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fln41ma603u758ik3efr5.png" alt="Hero Banner - AI Coding Agents" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The Numbers That Changed Everything&lt;/li&gt;
&lt;li&gt;What Actually Makes an AI Coding Agent Different&lt;/li&gt;
&lt;li&gt;Inside the Agentic Loop: The Core Architecture&lt;/li&gt;
&lt;li&gt;The Tool Ecosystem: How Agents Act on Your Codebase&lt;/li&gt;
&lt;li&gt;Context Window: The Most Critical Engineering Constraint&lt;/li&gt;
&lt;li&gt;CLAUDE.md — The New Developer Configuration Primitive&lt;/li&gt;
&lt;li&gt;Subagent Orchestration: Multi-Agent Patterns&lt;/li&gt;
&lt;li&gt;Integrating AI Coding Agents in CI/CD Pipelines&lt;/li&gt;
&lt;li&gt;The Token Economy: Costs, Enterprise Pricing, and Optimization&lt;/li&gt;
&lt;li&gt;Two Inflection Points That Changed Everything&lt;/li&gt;
&lt;li&gt;Engineering for Agent-First Workflows&lt;/li&gt;
&lt;li&gt;Future Outlook: Where AI Coding Agents Are Headed&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Numbers That Changed Everything
&lt;/h2&gt;

&lt;p&gt;Twenty-five percent.&lt;/p&gt;

&lt;p&gt;That's the share of all code commits at Uber that came through Claude Code in Q1 2026. Not a pilot program. Not a hackathon experiment. Regular, production-bound commits — at one of the most complex, multi-service, polyglot engineering organizations on the planet.&lt;/p&gt;

&lt;p&gt;Uber's engineering teams burned through their entire annual AI budget in a matter of months. Anthropic is reportedly on track to hit &lt;strong&gt;$10.9 billion in Q2 2026&lt;/strong&gt; — potentially its first-ever profitable quarter — driven overwhelmingly by enterprise coding agent usage. The SpaceX S-1 filed in May 2026 quietly disclosed that Anthropic had signed a contract to pay &lt;strong&gt;$1.25 billion per month&lt;/strong&gt; for compute capacity on Colossus I and Colossus II through May 2029, primarily for &lt;em&gt;inference&lt;/em&gt;, not model training.&lt;/p&gt;

&lt;p&gt;That last number is extraordinary. When a company is spending over a billion dollars per month just to serve responses to its users, the underlying technology has crossed from "promising technology" to &lt;strong&gt;critical infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For developers, this convergence of adoption signals and infrastructure spend is a loud signal: AI coding agents are not the future. They are the present. And the engineering teams that understand how they work — really work, at the architecture level — are the ones who will extract disproportionate value from them.&lt;/p&gt;

&lt;p&gt;This article is your deep technical map of that architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Makes an AI Coding Agent Different
&lt;/h2&gt;

&lt;p&gt;Before we go deep, we need to establish the conceptual boundary that separates an AI coding &lt;em&gt;agent&lt;/em&gt; from an AI coding &lt;em&gt;assistant&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A coding &lt;em&gt;assistant&lt;/em&gt; (think early-generation Copilot, or ChatGPT in a code context) operates in a strict request-response loop: you give it a prompt, it returns text. It has no memory of previous exchanges in a new session. It can see only what you paste into the window. It cannot run code, cannot read your filesystem, cannot check if its suggestions actually compile. It is, fundamentally, a very smart autocomplete.&lt;/p&gt;

&lt;p&gt;A coding &lt;em&gt;agent&lt;/em&gt; is an entirely different animal. The critical difference is &lt;strong&gt;tool use + autonomous looping&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An agent can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Read&lt;/strong&gt; your actual files — not just the snippet you paste&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute&lt;/strong&gt; shell commands, test suites, build systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search&lt;/strong&gt; the web and documentation in real time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edit&lt;/strong&gt; files, stage changes, create commits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chain&lt;/strong&gt; all of the above over dozens of steps &lt;em&gt;without waiting for your confirmation&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Course-correct&lt;/strong&gt; when a step fails, just as a human engineer would&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The philosophical shift is from "AI that answers questions about code" to "AI that &lt;em&gt;does engineering work&lt;/em&gt;." The implications for how you interact with it, how you measure its output, and how you manage its costs are profound.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inside the Agentic Loop: The Core Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xvrp8d90sc68fwkk1iw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xvrp8d90sc68fwkk1iw.png" alt="Agentic Loop Diagram" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Code's (and by analogy, OpenAI Codex's) core operation is governed by what the Anthropic engineering team calls the &lt;strong&gt;agentic loop&lt;/strong&gt; — a three-phase cycle that repeats until the task is complete or you interrupt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Gather Context
&lt;/h3&gt;

&lt;p&gt;Before touching any code, Claude's first move is to build a mental model of your codebase. This involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reading &lt;code&gt;CLAUDE.md&lt;/code&gt; and &lt;code&gt;MEMORY.md&lt;/code&gt; (persistent instruction files)&lt;/li&gt;
&lt;li&gt;Traversing directory structures to understand project layout&lt;/li&gt;
&lt;li&gt;Searching for files relevant to the task using glob and regex patterns&lt;/li&gt;
&lt;li&gt;Reading the git history to understand the intent behind existing code&lt;/li&gt;
&lt;li&gt;Fetching documentation URLs you've referenced in your prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This phase is dominated by read-only tool calls. The model is building its working memory by accumulating tokens into its context window — a critical resource we'll address in depth shortly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Take Action
&lt;/h3&gt;

&lt;p&gt;Once context is sufficient, Claude begins executing. Depending on the task, this might mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing or editing source files&lt;/li&gt;
&lt;li&gt;Running the test suite to establish a baseline&lt;/li&gt;
&lt;li&gt;Making targeted edits&lt;/li&gt;
&lt;li&gt;Running build commands to check for compile errors&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;git diff&lt;/code&gt; to review what changed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Actions are not pre-planned in a static sequence. The model decides what to do &lt;em&gt;next&lt;/em&gt; based on the output of the previous step. A failing test output feeds back into the loop, triggering another round of reading, editing, and re-running.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Verify Results
&lt;/h3&gt;

&lt;p&gt;Verification is what separates a good Claude Code session from a frustrating one. When you give Claude a verifiable success criterion — "the test suite passes," "the linter emits zero warnings," "the server returns 200 on &lt;code&gt;/health&lt;/code&gt;" — it can run that verification autonomously and loop back to fix any remaining failures.&lt;/p&gt;

&lt;p&gt;Without a verification criterion, the agent is flying blind. It produces output that &lt;em&gt;looks&lt;/em&gt; correct but may not &lt;em&gt;be&lt;/em&gt; correct, and you become the sole feedback mechanism.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; The agentic loop is not a linear pipeline. It is a control flow that can nest, branch, and retry — exactly like an engineer working through a problem. Your role is to define the &lt;em&gt;goal&lt;/em&gt; and the &lt;em&gt;acceptance criteria&lt;/em&gt;, not to specify every step.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Tool Ecosystem: How Agents Act on Your Codebase
&lt;/h2&gt;

&lt;p&gt;The agentic loop is powered by a set of tools that give the model agency beyond text generation. Claude Code's built-in tool categories break down as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What It Enables&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Operations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Read files, write/edit code, create new files, rename and reorganize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Find files by pattern (glob), search content with regex, explore large codebases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run shell commands, start dev servers, invoke test runners, use git&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Web&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search the web, fetch API documentation, look up error messages in real time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Intelligence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;See type errors and warnings post-edit, jump to definitions, find all references&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Beyond the core loop, Claude Code supports an &lt;strong&gt;extension layer&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MCP (Model Context Protocol) servers&lt;/strong&gt; — connect to external services (databases, issue trackers, observability platforms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills&lt;/strong&gt; — custom, on-demand instruction sets for domain-specific workflows (similar to macros)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hooks&lt;/strong&gt; — shell scripts that run before/after specific tool calls (e.g., auto-format after every file write)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subagents&lt;/strong&gt; — independent context windows for delegated sub-tasks (covered below)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The design philosophy is that Claude's &lt;em&gt;base capability&lt;/em&gt; is narrow by default — it can only do what tools explicitly allow — and you extend it deliberately. This is both a product decision and a cost-control mechanism.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context Window: The Most Critical Engineering Constraint
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50xinlcev4sxm9webo6c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F50xinlcev4sxm9webo6c.png" alt="Context Window Visualization" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If there is one thing experienced Claude Code users will tell you, it is this: &lt;strong&gt;the context window is your most precious resource, and it burns faster than you think.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The context window holds everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your initial prompt&lt;/li&gt;
&lt;li&gt;Every assistant response&lt;/li&gt;
&lt;li&gt;Every file Claude read (full content)&lt;/li&gt;
&lt;li&gt;Every shell command and its complete output&lt;/li&gt;
&lt;li&gt;The entire conversation history&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single debugging session that reads 10 medium-sized source files, runs the test suite three times, and explores a few dependency implementations can consume &lt;strong&gt;50,000–100,000 tokens without breaking a sweat&lt;/strong&gt;. As the context fills, measurable performance degradation occurs — the model starts losing grip on earlier instructions, makes inconsistent edits, and may contradict earlier reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Context Management Strategies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Start fresh sessions for distinct tasks.&lt;/strong&gt; Don't try to refactor a module and implement a new feature in the same session. Context pollution from the refactor will compromise the feature work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Use subagents for exploration.&lt;/strong&gt; When you need Claude to research the codebase before acting, delegate that to an Explore subagent (which runs in its own context) so search results don't flood your main conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Keep CLAUDE.md concise.&lt;/strong&gt; This file loads at the start of every session. Every line consumes tokens on every task. The Anthropic team's rule: if removing a line wouldn't cause Claude to make a mistake, remove it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Scope your file references.&lt;/strong&gt; Instead of asking Claude to "look at the whole auth module," reference specific files: &lt;code&gt;@src/auth/token_refresh.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Monitor context usage continuously.&lt;/strong&gt; Install a custom status line (Claude Code supports this natively) to track token consumption in real time, like a fuel gauge for your session.&lt;/p&gt;




&lt;h2&gt;
  
  
  CLAUDE.md — The New Developer Configuration Primitive
&lt;/h2&gt;

&lt;p&gt;Every mature software project has configuration artifacts: &lt;code&gt;.gitignore&lt;/code&gt;, &lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;pyproject.toml&lt;/code&gt;, &lt;code&gt;.eslintrc&lt;/code&gt;. These files encode project-level conventions that every contributor respects. &lt;code&gt;CLAUDE.md&lt;/code&gt; is the newest member of this family — the configuration artifact for AI agent behavior.&lt;/p&gt;

&lt;p&gt;When Claude Code starts a session, it reads &lt;code&gt;CLAUDE.md&lt;/code&gt; from your project root before doing anything else. This file is your persistent, version-controlled channel to the agent. Think of it as a brief to a new contractor: here's the project, here's how we work, here are the rules.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Well-Crafted CLAUDE.md
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Project: payments-service&lt;/span&gt;

&lt;span class="gu"&gt;## Architecture&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Python 3.12, FastAPI, SQLAlchemy async
&lt;span class="p"&gt;-&lt;/span&gt; PostgreSQL 16 via asyncpg
&lt;span class="p"&gt;-&lt;/span&gt; All async/await — no synchronous DB calls
&lt;span class="p"&gt;-&lt;/span&gt; Repository pattern: db layer never imported directly into routes

&lt;span class="gu"&gt;## Code Style&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Black formatting, line-length 88
&lt;span class="p"&gt;-&lt;/span&gt; Use &lt;span class="sb"&gt;`from __future__ import annotations`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Type hints required on all public functions
&lt;span class="p"&gt;-&lt;/span&gt; No &lt;span class="sb"&gt;`Any`&lt;/span&gt; in type hints unless absolutely justified

&lt;span class="gu"&gt;## Testing&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; pytest with pytest-asyncio
&lt;span class="p"&gt;-&lt;/span&gt; Run tests with: &lt;span class="sb"&gt;`make test`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Preferred: unit tests over integration tests; use &lt;span class="sb"&gt;`respx`&lt;/span&gt; to mock HTTP
&lt;span class="p"&gt;-&lt;/span&gt; Never use &lt;span class="sb"&gt;`unittest.mock.patch`&lt;/span&gt; — use dependency injection instead

&lt;span class="gu"&gt;## Workflow&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Always run &lt;span class="sb"&gt;`make lint &amp;amp;&amp;amp; make typecheck`&lt;/span&gt; after code changes
&lt;span class="p"&gt;-&lt;/span&gt; Write failing test first, then implementation
&lt;span class="p"&gt;-&lt;/span&gt; Commit message format: &lt;span class="sb"&gt;`type(scope): description`&lt;/span&gt; (Conventional Commits)

&lt;span class="gu"&gt;## Things to Avoid&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Do NOT use synchronous SQLAlchemy sessions
&lt;span class="p"&gt;-&lt;/span&gt; Do NOT add new dependencies without checking with me first
&lt;span class="p"&gt;-&lt;/span&gt; Do NOT suppress type errors with &lt;span class="sb"&gt;`# type: ignore`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what this file is &lt;em&gt;not&lt;/em&gt;: it's not a tutorial on Python, it's not a list of standard conventions every Python developer already knows. It encodes only the &lt;em&gt;project-specific&lt;/em&gt; rules that Claude would otherwise have to infer — and might infer incorrectly.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/init&lt;/code&gt; command generates a starting CLAUDE.md by analyzing your codebase. Use it as a foundation, then prune ruthlessly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Subagent Orchestration: Multi-Agent Patterns
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwna633855yc5scqhy3wc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwna633855yc5scqhy3wc.png" alt="Subagent Architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the most powerful — and underutilized — features of Claude Code is its native support for subagent orchestration. Subagents are independent Claude sessions, each with their own context window, system prompt, tool restrictions, and (optionally) a different model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Built-in Subagents
&lt;/h3&gt;

&lt;p&gt;Claude Code ships with three built-in subagents that activate automatically:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Subagent&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Explore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Claude Haiku (fast/cheap)&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;td&gt;Codebase search without polluting main context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inherits from parent&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;td&gt;Research phase of Plan Mode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;General-purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inherits from parent&lt;/td&gt;
&lt;td&gt;All tools&lt;/td&gt;
&lt;td&gt;Complex multi-step tasks with modifications&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Explore agent is worth understanding in detail. When you ask Claude to "understand how the authentication flow works before fixing this bug," Claude could read files directly in your main session — but that would consume your context budget with potentially irrelevant search results. Instead, it spawns an Explore subagent, which reads the files it needs, builds an understanding, and returns a &lt;em&gt;summary&lt;/em&gt; to the main session. The raw file content never appears in your main context.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Custom Subagents
&lt;/h3&gt;

&lt;p&gt;Custom subagents are defined as Markdown files with YAML frontmatter. Here's an example security review subagent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;security-reviewer&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Reviews code changes for security vulnerabilities, injection risks,&lt;/span&gt;
  &lt;span class="s"&gt;auth issues, and secrets exposure. Invoke when reviewing PRs or new endpoints.&lt;/span&gt;
&lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;
&lt;span class="na"&gt;tools&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Read&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Bash(git diff, git log)&lt;/span&gt;
&lt;span class="na"&gt;permission_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

You are a senior application security engineer. When invoked, you:
&lt;span class="p"&gt;1.&lt;/span&gt; Run &lt;span class="sb"&gt;`git diff HEAD~1`&lt;/span&gt; to see recent changes
&lt;span class="p"&gt;2.&lt;/span&gt; Check for SQL injection, XSS, SSRF, and auth bypass patterns
&lt;span class="p"&gt;3.&lt;/span&gt; Scan for hardcoded secrets or credentials
&lt;span class="p"&gt;4.&lt;/span&gt; Report findings with severity (Critical/High/Medium/Low) and remediation

Be concise. Only report genuine issues, not style nitpicks.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save this to &lt;code&gt;.claude/agents/security-reviewer.md&lt;/code&gt; in your project. Claude will automatically invoke it when context suggests a security review, or you can call it explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Agent Parallelism Pattern
&lt;/h3&gt;

&lt;p&gt;For large refactoring tasks, you can decompose work into parallel subagents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# In your main Claude Code session:
# "Split this refactor into 3 parallel tasks:
#  1. Update all tests to the new API signature
#  2. Update all route handlers
#  3. Update the DB layer
# Spawn one general-purpose agent per task and report when all three complete."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern dramatically accelerates large, decomposable tasks while keeping each subagent's context window clean and focused.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating AI Coding Agents in CI/CD Pipelines
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b9vz7yoi2gxv9w3y7e8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6b9vz7yoi2gxv9w3y7e8.png" alt="CI/CD Workflow with AI Agents" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Claude Code isn't just a local developer tool. It ships with a production-grade GitHub Actions integration that enables genuinely powerful automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setting Up Claude Code GitHub Actions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/claude-code.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Claude Code Agent&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;issue_comment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;created&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;pull_request_review_comment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;created&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;issues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;assigned&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;issues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;claude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;contains(github.event.comment.body, '@claude')&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropics/claude-code-action@v1&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;claude_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;--model claude-sonnet-4-6&lt;/span&gt;
            &lt;span class="s"&gt;--max-turns 15&lt;/span&gt;
            &lt;span class="s"&gt;--append-system-prompt "Follow the conventions in CLAUDE.md strictly. Always run tests before marking a task complete."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this workflow, anyone on your team can type &lt;code&gt;@claude implement the sorting feature described in this issue&lt;/code&gt; on a GitHub issue and the agent will: read the issue, explore the codebase, implement the feature, run the test suite, and open a Pull Request — all autonomously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated PR Review on Every Commit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/claude-review.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Automated Code Review&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropics/claude-code-action@v1&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;Review this PR for:&lt;/span&gt;
            &lt;span class="s"&gt;1. Logic errors and edge cases&lt;/span&gt;
            &lt;span class="s"&gt;2. Missing error handling&lt;/span&gt;
            &lt;span class="s"&gt;3. Performance issues (N+1 queries, unnecessary loops)&lt;/span&gt;
            &lt;span class="s"&gt;4. Security concerns (injection, auth bypasses, secrets exposure)&lt;/span&gt;
            &lt;span class="s"&gt;5. Test coverage gaps&lt;/span&gt;

            &lt;span class="s"&gt;Do NOT comment on style — we have a linter for that.&lt;/span&gt;
            &lt;span class="s"&gt;Post your review as a PR review with inline comments.&lt;/span&gt;
          &lt;span class="na"&gt;claude_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--max-turns&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scheduled Maintenance Agent
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/daily-maintenance.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Daily Maintenance Agent&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1-5"&lt;/span&gt;  &lt;span class="c1"&gt;# Weekdays at 6 AM&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;maintain&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;anthropics/claude-code-action@v1&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;anthropic_api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;Perform daily maintenance:&lt;/span&gt;
            &lt;span class="s"&gt;1. Check for dependency security advisories (run: pip-audit or npm audit)&lt;/span&gt;
            &lt;span class="s"&gt;2. Update any dependencies with only patch version bumps&lt;/span&gt;
            &lt;span class="s"&gt;3. Run the full test suite to confirm nothing broke&lt;/span&gt;
            &lt;span class="s"&gt;4. If all tests pass, open a PR titled "chore: automated dependency updates [DATE]"&lt;/span&gt;
          &lt;span class="na"&gt;claude_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--model&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--max-turns&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;20"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Token Economy: Costs, Enterprise Pricing, and Optimization
&lt;/h2&gt;

&lt;p&gt;Let's talk money — because this is where many engineering teams get surprised.&lt;/p&gt;

&lt;p&gt;Until late 2025, Anthropic Enterprise customers had fixed-seat pricing with generous included usage. In November 2025, Anthropic shifted to API-token pricing for enterprise, meaning every token your team burns in Claude Code is billed at the published API rate. OpenAI made the same move for Codex in April 2026.&lt;/p&gt;

&lt;p&gt;Simon Willison ran &lt;code&gt;ccusage&lt;/code&gt; on his own 30-day usage and found he would have spent &lt;strong&gt;$1,199.79 at API rates&lt;/strong&gt; — for what cost him $100 under his Max subscription. Enterprise companies using coding agents at scale are seeing four- and five-figure monthly bills per team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Estimating Your Costs
&lt;/h3&gt;

&lt;p&gt;Here's a rough Python calculator for Claude Code cost estimation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Claude Sonnet 4.6 approximate pricing
# (verify current rates at anthropic.com/pricing before using in production)
&lt;/span&gt;&lt;span class="n"&gt;SONNET_INPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.00&lt;/span&gt;   &lt;span class="c1"&gt;# $3.00 per million input tokens
&lt;/span&gt;&lt;span class="n"&gt;SONNET_OUTPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;15.00&lt;/span&gt;  &lt;span class="c1"&gt;# $15.00 per million output tokens
&lt;/span&gt;
&lt;span class="c1"&gt;# Opus 4.7 (for complex tasks) — verify this stat before publishing
&lt;/span&gt;&lt;span class="n"&gt;OPUS_INPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;15.00&lt;/span&gt;
&lt;span class="n"&gt;OPUS_OUTPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;75.00&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate_session_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;avg_context_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;avg_output_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sessions_per_day&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;working_days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Estimate monthly Claude Code cost for an engineer.

    Args:
        avg_context_tokens: Average input tokens per session
        avg_output_tokens: Average output tokens per session
        sessions_per_day: How many Claude Code sessions per day
        model: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;opus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
        working_days: Working days per month
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;input_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SONNET_INPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;
        &lt;span class="n"&gt;output_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SONNET_OUTPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;input_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OPUS_INPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;
        &lt;span class="n"&gt;output_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OPUS_OUTPUT_COST_PER_MTok&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1_000_000&lt;/span&gt;

    &lt;span class="n"&gt;sessions_per_month&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sessions_per_day&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;working_days&lt;/span&gt;
    &lt;span class="n"&gt;monthly_input_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;avg_context_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sessions_per_month&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;input_rate&lt;/span&gt;
    &lt;span class="n"&gt;monthly_output_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;avg_output_tokens&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;sessions_per_month&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;output_rate&lt;/span&gt;
    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;monthly_input_cost&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;monthly_output_cost&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_input_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monthly_input_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;monthly_output_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monthly_output_cost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_monthly_cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sessions_per_month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sessions_per_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost_per_session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;sessions_per_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Example: developer running 8 sessions/day on Sonnet
# Each session: ~40K context tokens, ~4K output tokens
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;estimate_session_cost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;avg_context_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;avg_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sessions_per_day&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Estimated monthly cost: $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_monthly_cost&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sessions per month:     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sessions_per_month&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cost per session:       $&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cost_per_session&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output:
# Estimated monthly cost: $323.84
# Sessions per month:     176
# Cost per session:       $1.84
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost Optimization Strategies
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Route to the right model.&lt;/strong&gt; Use Sonnet for routine implementation tasks. Reserve Opus for complex architectural reasoning. Haiku for exploration-only subagents. Model routing alone can cut costs by 60–80%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cap max-turns in CI/CD.&lt;/strong&gt; In automated pipelines, set &lt;code&gt;--max-turns 10&lt;/code&gt; to prevent runaway sessions that loop indefinitely on ambiguous tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invest in CLAUDE.md quality.&lt;/strong&gt; A precise CLAUDE.md reduces back-and-forth correction cycles. Every iteration you eliminate saves ~10K tokens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Route exploration to Haiku subagents.&lt;/strong&gt; Delegate codebase search to Haiku-powered Explore subagents at ~0.25x the Sonnet cost, returning only summaries to your main session.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set team budgets and alerts.&lt;/strong&gt; Use Anthropic Console's usage monitoring to set per-team monthly caps and alert thresholds before the bill surprises you.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Two Inflection Points That Changed Everything
&lt;/h2&gt;

&lt;p&gt;To understand why AI coding agents are exploding &lt;em&gt;now&lt;/em&gt; rather than two years ago when ChatGPT launched, you need to understand two specific inflection points.&lt;/p&gt;

&lt;h3&gt;
  
  
  November 2025: The Capability Inflection
&lt;/h3&gt;

&lt;p&gt;Prior to November 2025, LLM-based coding tools were impressive at &lt;em&gt;generating&lt;/em&gt; code but unreliable at &lt;em&gt;completing engineering tasks autonomously&lt;/em&gt;. They would hallucinate APIs, forget constraints set earlier in the conversation, and fail to recover from tool errors without human intervention.&lt;/p&gt;

&lt;p&gt;November 2025 saw the release of GPT-5.1 and Anthropic's Opus 4.5 — the first models that, combined with their respective agent harnesses, could &lt;strong&gt;reliably complete real multi-step engineering tasks end-to-end&lt;/strong&gt;. The improvements were in long-context coherence, tool-use reliability, and error-recovery reasoning. The practical result: engineers started trusting the output enough to incorporate it into production workflows.&lt;/p&gt;

&lt;p&gt;Six months of accelerating adoption followed — what Simon Willison calls the "November inflection point."&lt;/p&gt;

&lt;h3&gt;
  
  
  April 2026: The Revenue Inflection
&lt;/h3&gt;

&lt;p&gt;April 2026 marked a different kind of inflection: the moment the business model snapped into place. Both Anthropic and OpenAI released new frontier models (Opus 4.7, GPT-5.5) at higher API prices — and simultaneously locked enterprise customers into token-based billing at those rates.&lt;/p&gt;

&lt;p&gt;Enterprise customers who had been running coding agents under generous flat-rate contracts found themselves, upon renewal, paying full API prices. Uber's budget story, Microsoft's Claude Code license cancellations, and Anthropic's approaching profitability all trace back to this single structural change.&lt;/p&gt;

&lt;p&gt;For engineers, the revenue inflection sends a clear signal: these tools are generating enough measurable value that customers will pay premium API rates for them. That's a fundamentally different signal than "this is a compelling demo."&lt;/p&gt;




&lt;h2&gt;
  
  
  Engineering for Agent-First Workflows
&lt;/h2&gt;

&lt;p&gt;Adopting AI coding agents isn't just an installation step — it's a workflow redesign. Here are the patterns that consistently produce the best results:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Verification-First Prompt Design
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Weak prompt — no verifiable success criterion
"Fix the login bug"

# Strong prompt — autonomous feedback loop possible
"Users report login fails after session timeout.
The issue is in src/auth/. Check token refresh logic.
Write a failing test reproducing the issue, fix it, and confirm
the test passes. Cover: expired tokens, near-expiry refresh,
and concurrent refresh requests."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is whether the agent can close its own feedback loop. A verifiable criterion lets the agentic loop run autonomously to completion.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Explore → Plan → Code (Never Code First)
&lt;/h3&gt;

&lt;p&gt;Resist the urge to have Claude start coding immediately. Use Plan Mode (prefix your prompt with &lt;code&gt;/plan&lt;/code&gt;) to force a read-only exploration and planning phase before any files are written. This single practice eliminates the most common failure mode: Claude solving the wrong problem with correct code.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Test-Driven Agent Workflow
&lt;/h3&gt;

&lt;p&gt;Write (or have Claude write) a failing test &lt;em&gt;before&lt;/em&gt; implementation. This gives the agent a clear, automated verification gate. The loop becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;write failing test → run test (RED) → implement → run test (GREEN) → commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mirrors traditional TDD but with the agent driving both the test and the implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Session Decomposition for Large Tasks
&lt;/h3&gt;

&lt;p&gt;For tasks longer than ~30 minutes of work, decompose explicitly before starting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Before we start, decompose this feature into the minimum set of
independent tasks, ordered by dependency. I want to run each
as a separate Claude Code session."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This avoids context overflow mid-task and creates natural checkpoints from which you can restart.&lt;/p&gt;




&lt;h2&gt;
  
  
  Future Outlook: Where AI Coding Agents Are Headed
&lt;/h2&gt;

&lt;p&gt;The $1.25B/month inference bill is not just a financial data point — it's a directional signal about where the compute investment is flowing. Several trajectories are worth tracking closely:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model silicon specialization.&lt;/strong&gt; General-purpose GPUs are fundamentally inefficient for inference workloads. As model architectures stabilize, expect inference-optimized ASICs analogous to what TPUs did for training. Step-function reductions in cost-per-token and latency will make coding agents dramatically cheaper to operate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-agent swarms for large-scale engineering.&lt;/strong&gt; The subagent orchestration model in Claude Code today is a preview of "engineering swarms" where a high-level orchestrator decomposes a large feature, dispatches it to dozens of specialist agents in parallel, and synthesizes the results. The bottleneck today is orchestration reliability and context synchronization — both active areas of development.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expansion beyond software engineering.&lt;/strong&gt; Coding agents succeeded first because the feedback loop (compile, test, lint) is unusually tight and objective. Expect analogous agents to expand into data analysis, financial modeling, infrastructure provisioning, and any knowledge-work domain with structured verification criteria.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bidirectional model distillation.&lt;/strong&gt; As enterprise deployment patterns mature, the feedback loops from production usage will inform fine-tuned, domain-specific model variants — smaller, cheaper, faster models that specialize in specific codebases or engineering domains.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;AI coding agents are not a productivity multiplier bolted onto the existing way of working. They're a different paradigm — one where the developer's primary outputs shift from &lt;em&gt;writing code&lt;/em&gt; to &lt;em&gt;specifying goals, defining acceptance criteria, and reviewing the work of an autonomous engineering collaborator.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The architecture underlying this shift — the agentic loop, tool orchestration, context window management, subagent parallelism, CI/CD integration — is learnable and masterable. Engineers who invest in understanding these systems now are positioning themselves to extract compounding returns as the technology matures.&lt;/p&gt;

&lt;p&gt;Uber's 25% commit figure will look quaint by 2027. The teams that understand the architecture, manage costs intelligently, and build robust agent-first workflows are the ones that will get there first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ready to start?&lt;/strong&gt; Pick one task from your backlog today. Write a &lt;code&gt;CLAUDE.md&lt;/code&gt; for your project. Write one failing test. Hand it to the agent. See where the loop takes you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Trending topic sourced from Hacker News, Substack, and TechCrunch — May 28, 2026 | Focus keyword: AI coding agents | Estimated read time: 14 minutes&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Have you deployed AI coding agents in your production workflow? What patterns have you found most effective? Drop your experience in the comments below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>coding</category>
      <category>productivity</category>
      <category>devops</category>
    </item>
    <item>
      <title>How DeepMind AlphaProof Nexus Cracks 56-Year-Old Math: Agentic LLM Loops and Lean Formal Verification</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Wed, 27 May 2026 11:57:34 +0000</pubDate>
      <link>https://dev.to/monuminu/how-deepmind-alphaproof-nexus-cracks-56-year-old-math-agentic-llm-loops-and-lean-formal-45ei</link>
      <guid>https://dev.to/monuminu/how-deepmind-alphaproof-nexus-cracks-56-year-old-math-agentic-llm-loops-and-lean-formal-45ei</guid>
      <description>&lt;h1&gt;
  
  
  How Google DeepMind's AlphaProof Nexus Cracks 56-Year-Old Math Problems: A Deep Dive into Agentic LLM Loops and Lean Formal Verification
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Published: May 27, 2026 | Focus Keyword: AI formal proof generation | ~15 min read&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The $300 Proof That Shook the Math World&lt;/li&gt;
&lt;li&gt;The Core Problem: Why LLMs Hallucinate Their Way Through Mathematics&lt;/li&gt;
&lt;li&gt;Enter Lean 4: The Compiler as an Infallible Truth Machine&lt;/li&gt;
&lt;li&gt;AlphaProof Nexus: Architecture Overview&lt;/li&gt;
&lt;li&gt;The Four Agents: A Deep Dive into Each Design&lt;/li&gt;
&lt;li&gt;The Evolutionary Engine: Elo, P-UCB, and Fitness Without a Gradient&lt;/li&gt;
&lt;li&gt;Results That Stunned the Mathematics Community&lt;/li&gt;
&lt;li&gt;The Shocking Finding: The Simple Agent Wins Too&lt;/li&gt;
&lt;li&gt;Failure Modes: Where the System Still Breaks Down&lt;/li&gt;
&lt;li&gt;What This Means for Engineers: The Agentic Paradigm Shift in AI Formal Proof Generation&lt;/li&gt;
&lt;li&gt;Conclusion: The Future of Neuro-Symbolic AI&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm8orhaaj18c22nm1obu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm8orhaaj18c22nm1obu.png" alt="AlphaProof Nexus — AI solving formal mathematics with Lean verification" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The $300 Proof That Shook the Math World
&lt;/h2&gt;

&lt;p&gt;In 1970, mathematicians Paul Erdős and András Sárközy posed a deceptively simple-sounding question about infinite sets of integers: can you construct a set &lt;code&gt;A&lt;/code&gt; where no element divides the sum of two larger elements, yet the set is dense enough that its size grows faster than &lt;code&gt;√N&lt;/code&gt;? Researchers worked on it for 56 years — publishing partial results, tightening bounds, never quite closing the gap.&lt;/p&gt;

&lt;p&gt;On May 21, 2026, a Google DeepMind AI agent resolved it autonomously, overnight, at an inference cost of a few hundred dollars.&lt;/p&gt;

&lt;p&gt;This wasn't a party trick or a carefully cherry-picked benchmark. It was one of nine open Erdős problems that &lt;strong&gt;AlphaProof Nexus&lt;/strong&gt; resolved in a single systematic sweep. The system also proved 44 previously unproven conjectures from the Online Encyclopedia of Integer Sequences (OEIS), settled a 15-year-old open question in algebraic geometry, and improved an open convergence bound in convex optimization by &lt;em&gt;discovering a novel algorithmic parameter schedule nobody had previously identified&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The paper — &lt;a href="https://arxiv.org/abs/2605.22763" rel="noopener noreferrer"&gt;arXiv:2605.22763&lt;/a&gt; by Tsoukalas et al. at Google DeepMind — is the first large-scale evaluation of &lt;strong&gt;AI formal proof generation&lt;/strong&gt; on genuinely open, not competition-style, research mathematics. And buried in its results is an engineering insight that should fundamentally change how you think about building reliable AI systems.&lt;/p&gt;

&lt;p&gt;This post is a complete technical deep dive. We will dissect the architecture, study the agent designs, look at real code patterns, and extract the lessons that matter most for engineers building production AI systems today.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Core Problem: Why LLMs Hallucinate Their Way Through Mathematics
&lt;/h2&gt;

&lt;p&gt;Let's establish the baseline. Modern frontier LLMs — GPT-5.x, Claude Opus 4.x, Gemini 3.x — are remarkably capable at mathematical reasoning. They can outline proofs, suggest approaches, handle competition-level problems, and generate natural-language arguments that &lt;em&gt;look&lt;/em&gt; correct.&lt;/p&gt;

&lt;p&gt;The operative word is &lt;em&gt;look&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;When an LLM writes a multi-step mathematical proof in natural language, it operates on statistical plausibility. Each token is generated to be likely given prior context. There is no internal mechanism checking whether logical step N actually follows from step N-1. The model can write "it is easy to see that..." and proceed with a claim that is subtly — or catastrophically — false. In a 40-step proof, an error in step 12 can invalidate everything that follows, and it may not be obvious to anyone who is not a domain expert.&lt;/p&gt;

&lt;p&gt;This creates a brutal bottleneck for deploying LLMs in real mathematics research:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Review cost is enormous.&lt;/strong&gt; A domain expert must verify every step of every LLM-generated argument. For complex proofs, this can take days or weeks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error cascade is invisible.&lt;/strong&gt; Small hallucinations in intermediate steps do not throw exceptions — they produce wrong mathematics that still reads fluently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trust is binary, but confidence is continuous.&lt;/strong&gt; You cannot partially trust a proof. Either every step is valid, or the whole thing is suspect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the problem that AI formal proof generation is designed to solve — and AlphaProof Nexus is the most ambitious attempt at it yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Enter Lean 4: The Compiler as an Infallible Truth Machine
&lt;/h2&gt;

&lt;p&gt;Lean 4 is a proof assistant and functional programming language developed by Leonardo de Moura at Microsoft Research (now at AWS). In Lean, mathematical proofs are &lt;em&gt;programs&lt;/em&gt;. Theorems are &lt;em&gt;types&lt;/em&gt;. A proof of a theorem is a term of that type. And critically: &lt;strong&gt;the Lean compiler verifies every single tactic step mechanically&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here is what a simple Lean 4 proof looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- Theorem: the sum of the first n natural numbers equals n*(n+1)/2&lt;/span&gt;
&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;sum_formula&lt;/span&gt; (&lt;span class="n"&gt;n&lt;/span&gt; : &lt;span class="o"&gt;ℕ&lt;/span&gt;) : &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="err"&gt;∑&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;Finset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt; (&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;), &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; (&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;) := &lt;span class="k"&gt;by&lt;/span&gt;
  &lt;span class="n"&gt;induction&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;zero&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;simp&lt;/span&gt;
  &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;succ&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="n"&gt;ih&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="n"&gt;rw&lt;/span&gt; [&lt;span class="n"&gt;Finset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sum_range_succ&lt;/span&gt;]
    &lt;span class="n"&gt;ring_nf&lt;/span&gt;
    &lt;span class="n"&gt;linarith&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every &lt;code&gt;by&lt;/code&gt;, &lt;code&gt;induction&lt;/code&gt;, &lt;code&gt;rw&lt;/code&gt;, &lt;code&gt;simp&lt;/code&gt;, &lt;code&gt;ring_nf&lt;/code&gt;, and &lt;code&gt;linarith&lt;/code&gt; is a &lt;em&gt;tactic&lt;/em&gt; — an elementary, mechanically verifiable proof step. The compiler tracks the current set of &lt;strong&gt;proof goals&lt;/strong&gt; after each tactic. A proof is complete and correct if and only if it leads to a state with &lt;em&gt;zero remaining goals&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Lean has a special escape hatch: the &lt;code&gt;sorry&lt;/code&gt; tactic. It immediately closes all pending goals without actually proving them — a placeholder meaning "I'll fill this in later." A file containing &lt;code&gt;sorry&lt;/code&gt; compiles, but the theorem is not proven. AlphaProof Nexus's entire objective is to take a Lean file where the proof body is replaced by &lt;code&gt;sorry&lt;/code&gt; and produce a fully &lt;code&gt;sorry&lt;/code&gt;-free version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- Input to AlphaProof Nexus: theorem with sorry placeholder&lt;/span&gt;
&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;erdos_12i&lt;/span&gt; : &lt;span class="o"&gt;∃&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; : &lt;span class="n"&gt;Set&lt;/span&gt; &lt;span class="o"&gt;ℕ&lt;/span&gt;, &lt;span class="n"&gt;IsMultiplicativelyIndependent&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;∧&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;liminf&lt;/span&gt; (&lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="err"&gt;∩&lt;/span&gt; &lt;span class="n"&gt;Finset&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;Real&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sqrt&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;) := &lt;span class="k"&gt;by&lt;/span&gt;
  &lt;span class="n"&gt;sorry&lt;/span&gt;&lt;span class="cd"&gt;  -- ← AlphaProof Nexus will replace this with a verified proof&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The compiler feedback loop is the key architectural primitive. When the LLM generates a proof step that is wrong, Lean does not silently produce incorrect output — it returns a structured error message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error: tactic 'ring_nf' failed, no progress made
⊢ 2 * (∑ i in Finset.range (n + 1), i + (n + 1)) = (n + 1) * (n + 2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This error contains the exact current proof state, the failing tactic, and what the remaining goal looks like. It is a gold mine of structured feedback the LLM can reason about in the next turn — unlike natural-language proof review, where a human must figure out &lt;em&gt;why&lt;/em&gt; an argument is wrong before correcting it.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. AlphaProof Nexus: Architecture Overview
&lt;/h2&gt;

&lt;p&gt;AlphaProof Nexus is a &lt;em&gt;framework&lt;/em&gt; for building agents that interleave LLM calls with Lean compiler calls. The I/O contract is clean:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Lean &lt;code&gt;.lean&lt;/code&gt; file containing a theorem statement with &lt;code&gt;sorry&lt;/code&gt; where the proof should go&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EVOLVE-BLOCK&lt;/code&gt; markers designating which code regions the agent may modify&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EVOLVE-VALUE&lt;/code&gt; markers on expressions the agent can change (e.g., algorithm parameters)&lt;/li&gt;
&lt;li&gt;Optionally: natural language context, domain knowledge encoded in Lean, relevant Mathlib lemmas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;sorry&lt;/code&gt;-free Lean proof of the target theorem&lt;/li&gt;
&lt;li&gt;A natural language summary of the proof strategy
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- Example: annotated input file for AlphaProof Nexus&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Mathlib&lt;/span&gt;&lt;span class="cd"&gt;

-- EVOLVE-BLOCK begin
-- Agent may introduce helper lemmas and definitions here
-- EVOLVE-BLOCK end&lt;/span&gt;

&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;target_theorem&lt;/span&gt; (&lt;span class="n"&gt;n&lt;/span&gt; : &lt;span class="o"&gt;ℕ&lt;/span&gt;) : &lt;span class="n"&gt;SomeMathematicalStatement&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="cd"&gt;
  -- EVOLVE-BLOCK begin&lt;/span&gt;
  &lt;span class="n"&gt;sorry&lt;/span&gt;&lt;span class="cd"&gt;  -- Agent replaces this with a complete proof
  -- EVOLVE-BLOCK end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework runs a &lt;em&gt;pool&lt;/em&gt; of parallel subagents, each independently searching for a proof. This parallelism is crucial — proof search is highly non-deterministic, and running N independent searches simultaneously dramatically raises the probability of at least one succeeding within the compute budget.&lt;/p&gt;

&lt;p&gt;All agents are powered by &lt;strong&gt;Gemini 3.1 Pro&lt;/strong&gt; as the primary LLM, with lighter-weight &lt;strong&gt;Gemini 3.0 Flash&lt;/strong&gt; used for cheaper rating and evaluation tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Four Agents: A Deep Dive into Each Design
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1khp4ppb23t2oamboj8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1khp4ppb23t2oamboj8w.png" alt="AlphaProof Nexus Agent Architecture — Four Tiers from A to D" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AlphaProof Nexus defines four agent variants (A through D) of increasing sophistication. Understanding the design decisions behind each is essential for applying these patterns in your own agentic systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent A: The Ralph Loop (Baseline)
&lt;/h3&gt;

&lt;p&gt;The simplest agent implements what the paper calls a &lt;strong&gt;"Ralph loop"&lt;/strong&gt; — a name that is likely to become a term of art in agentic AI engineering. The pattern is clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ralph_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theorem_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lean_compiler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_episodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    The core agentic primitive: LLM generates proof steps,
    Lean validates them deterministically, errors feed back.
    Named &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ralph loop&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; in the AlphaProof Nexus paper (arXiv:2605.22763).
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;proof_sketch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_theorem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theorem_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Contains sorry placeholder
&lt;/span&gt;    &lt;span class="n"&gt;lessons_learned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;episode&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_episodes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Multi-turn LLM inference: reason via chain-of-thought,
&lt;/span&gt;        &lt;span class="c1"&gt;# refine the sketch using search-and-replace tool calls
&lt;/span&gt;        &lt;span class="n"&gt;updated_sketch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_episode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;proof_sketch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lessons_learned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_replace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Lean compiler checks every tactic step — deterministically
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lean_compiler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;updated_sketch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_valid&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;no_sorry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;updated_sketch&lt;/span&gt;  &lt;span class="c1"&gt;# 🎉 Complete, verified proof
&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract structured lesson from compiler feedback
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_valid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;lesson&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Episode &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;episode&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: tactic &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failed_tactic&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; failed.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Goal state was: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;goal_state&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Compiler error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error_message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;lessons_learned&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lesson&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;proof_sketch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;updated_sketch&lt;/span&gt;  &lt;span class="c1"&gt;# Carry partial progress forward
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Proof not found within episode budget
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: the &lt;strong&gt;Lean compiler's error message is fed directly back into the LLM's context&lt;/strong&gt; for the next turn. The LLM sees exactly what went wrong, at which tactic, and what the current proof state is. This structured feedback is what makes even the basic agent surprisingly powerful — far more so than a free-form "try again" loop.&lt;/p&gt;

&lt;p&gt;Each episode ends by appending a natural-language summary of lessons learned as a comment in the Lean file. This accumulates contextual knowledge across episodes without flooding the context window with raw compiler output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent B: AlphaProof as a Sub-Tool
&lt;/h3&gt;

&lt;p&gt;Agent B extends Agent A by giving the prover subagent the ability to &lt;strong&gt;call AlphaProof&lt;/strong&gt; — Google's existing RL system for olympiad-level theorem proving. When the prover encounters a sub-goal it cannot handle, it delegates to AlphaProof:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- The prover decomposes the proof and delegates a sub-goal to AlphaProof&lt;/span&gt;
&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;main_result&lt;/span&gt; : &lt;span class="n"&gt;ComplexStatement&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="cd"&gt;
  -- AlphaProof handles this tractable sub-goal via RL search&lt;/span&gt;
  &lt;span class="k"&gt;have&lt;/span&gt; &lt;span class="n"&gt;key_lemma&lt;/span&gt; : &lt;span class="n"&gt;TractableSubGoal&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;
    &lt;span class="n"&gt;exact&lt;/span&gt; &lt;span class="n"&gt;alphaproof_result&lt;/span&gt;&lt;span class="cd"&gt;  -- Substituted in if AlphaProof succeeds
  -- Prover must handle this one directly; AlphaProof returned failure&lt;/span&gt;
  &lt;span class="k"&gt;have&lt;/span&gt; &lt;span class="n"&gt;auxiliary&lt;/span&gt; : &lt;span class="n"&gt;HarderSubGoal&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;
    &lt;span class="n"&gt;induction&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="cd"&gt;  -- Agent writes this manually&lt;/span&gt;
  &lt;span class="n"&gt;exact&lt;/span&gt; &lt;span class="n"&gt;combine&lt;/span&gt; &lt;span class="n"&gt;key_lemma&lt;/span&gt; &lt;span class="n"&gt;auxiliary&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AlphaProof returns one of three signals, all fed back as structured prompt context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Proof found&lt;/strong&gt;: directly substituted into the sketch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disproof found&lt;/strong&gt;: the sub-goal is &lt;em&gt;false&lt;/em&gt;, telling the prover it decomposed the problem incorrectly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure&lt;/strong&gt;: AlphaProof couldn't resolve it within budget — try a different decomposition&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Agent C: Evolutionary Population with Elo Ratings
&lt;/h3&gt;

&lt;p&gt;Agent C introduces the evolutionary component, inspired by &lt;a href="https://arxiv.org/abs/2506.01882" rel="noopener noreferrer"&gt;AlphaEvolve&lt;/a&gt;. The core innovation is the &lt;strong&gt;Population Database&lt;/strong&gt; — a shared repository of proof sketches that all prover subagents read from and contribute to.&lt;/p&gt;

&lt;p&gt;The challenge: standard evolutionary algorithms assume a &lt;em&gt;graduated fitness landscape&lt;/em&gt;. Proof checking is binary — either it compiles &lt;code&gt;sorry&lt;/code&gt;-free or it does not. There is no gradient to follow. Agent C's elegant solution is to use a pool of cheap Gemini 3.0 Flash &lt;strong&gt;rating agents&lt;/strong&gt; to judge proof sketches head-to-head on plausibility, clarity, and novelty — creating a continuous proxy fitness signal from binary outcomes.&lt;/p&gt;

&lt;p&gt;These pairwise ratings aggregate into &lt;strong&gt;Elo scores&lt;/strong&gt; for each sketch. New prover episodes sample from the population using the &lt;strong&gt;P-UCB formula&lt;/strong&gt; (borrowed directly from AlphaZero), maintaining diversity by balancing high-Elo exploitation against under-explored sketch exploration.&lt;/p&gt;

&lt;p&gt;With this continuous fitness proxy established, Agent D combines all three capabilities into the full system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agent D: The Full System
&lt;/h3&gt;

&lt;p&gt;Agent D combines the Ralph loop, AlphaProof sub-tool integration, and the evolutionary population database. It was used for all main Erdős and OEIS experiments. Its most powerful feature is the &lt;code&gt;EVOLVE-VALUE&lt;/code&gt; mechanism — the ability to simultaneously search for a proof &lt;em&gt;and&lt;/em&gt; discover optimal algorithm parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- Agent D: joint search over algorithm parameters AND proofs
-- This is how it discovered a novel learning rate schedule for Anchored GDA&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; (&lt;span class="n"&gt;t&lt;/span&gt; : &lt;span class="o"&gt;ℕ&lt;/span&gt;) : &lt;span class="err"&gt;ℝ&lt;/span&gt; :=&lt;span class="cd"&gt;
  -- EVOLVE-VALUE begin&lt;/span&gt;
  &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; (&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;)  &lt;span class="cd"&gt;-- Agent replaced the original guess with this novel schedule
  -- EVOLVE-VALUE end&lt;/span&gt;

&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;anchored_gda_convergence&lt;/span&gt; (&lt;span class="n"&gt;T&lt;/span&gt; : &lt;span class="o"&gt;ℕ&lt;/span&gt;) (&lt;span class="n"&gt;hT&lt;/span&gt; : &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;) :
    &lt;span class="o"&gt;∃&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; : &lt;span class="err"&gt;ℝ&lt;/span&gt;, &lt;span class="n"&gt;convergence_gap&lt;/span&gt; &lt;span class="n"&gt;anchored_gda&lt;/span&gt; &lt;span class="n"&gt;learning_rate&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="o"&gt;≤&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="cd"&gt;
  -- EVOLVE-BLOCK begin&lt;/span&gt;
  &lt;span class="n"&gt;sorry&lt;/span&gt;&lt;span class="cd"&gt;  -- Agent fills this with a complete O(1/t) convergence proof
  -- EVOLVE-BLOCK end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the convex optimization experiment, the agent was given an &lt;code&gt;EVOLVE-VALUE&lt;/code&gt; block over the learning schedule. It did not just find the proof — it discovered a novel schedule that achieves a strictly better &lt;code&gt;O(1/t)&lt;/code&gt; convergence rate than what was previously known.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. The Evolutionary Engine: Elo, P-UCB, and Fitness Without a Gradient
&lt;/h2&gt;

&lt;p&gt;The Elo + P-UCB system deserves focused attention because it solves a broadly applicable problem: how do you run evolutionary search when your primary reward signal is binary or extremely sparse?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1 — Pairwise Rating:&lt;/strong&gt; Cheap Gemini 3.0 Flash rating agents receive two proof sketches and score them head-to-head. Does sketch A structure the problem more clearly than sketch B? Does sketch B try a more novel approach? These are continuous judgments that don't require either proof to be complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2 — Elo Aggregation:&lt;/strong&gt; Standard Elo updates run after each pairwise match:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_elo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;winner_rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loser_rating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;32.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Standard Elo update after a head-to-head proof sketch comparison.
    Provides a continuous fitness signal for binary proof outcomes.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;expected_winner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;loser_rating&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;winner_rating&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;expected_loser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;expected_winner&lt;/span&gt;
    &lt;span class="n"&gt;new_winner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;winner_rating&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;expected_winner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_loser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loser_rating&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;expected_loser&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_winner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_loser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stage 3 — P-UCB Sampling:&lt;/strong&gt; New prover episodes sample from the population using the P-UCB formula, balancing exploitation of high-Elo sketches with exploration of under-sampled ones:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;p_ucb_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sketch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_visits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c_puct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    P-UCB sampling score, borrowed from AlphaZero&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s tree search.
    Balances exploitation (high Elo rating) with exploration (low visit count).

    Args:
        sketch: Proof sketch with .elo_rating (0–1 normalized) and .visit_count
        total_visits: Total visits across all sketches in the population
        c_puct: Exploration constant (higher = more exploration)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;exploitation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sketch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;elo_rating&lt;/span&gt;
    &lt;span class="n"&gt;exploration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c_puct&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_visits&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sketch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visit_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;exploitation&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;exploration&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sample_from_population&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c_puct&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Sample a proof sketch using P-UCB weighted softmax.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;total_visits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;visit_count&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;p_ucb_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_visits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c_puct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Softmax over P-UCB scores
&lt;/span&gt;    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;population&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result is a self-improving search process: successful proof strategies get amplified, diverse approaches stay in circulation, and the system accumulates structured knowledge across all parallel search threads — even for problems where no proof has yet been found.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Results That Stunned the Mathematics Community
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3661u25tqdkdiyx8t24n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3661u25tqdkdiyx8t24n.png" alt="AlphaProof Nexus Results — Solved Erdős Problems, OEIS Conjectures, and Mathematical Breakthroughs" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scope of AlphaProof Nexus's results is extraordinary. Here is what Agent D accomplished in its first large-scale run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Erdős Problems: 9 of 353 Solved
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://erdosproblems.com/" rel="noopener noreferrer"&gt;Erdős Problems repository&lt;/a&gt; contains over 1,200 open problems posed by Paul Erdős. Agent D ran on all 353 that had been formalized in Lean, with a budget of 3,000 episodes per problem.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Open Since&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#12(i)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dense multiplicatively independent sets (Erdős–Sárközy)&lt;/td&gt;
&lt;td&gt;1970 — &lt;strong&gt;56 years&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#125&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lower density of sumsets from base-3 and base-4 digit sets (Burroughs–Erdős)&lt;/td&gt;
&lt;td&gt;1996 — 30 years&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#138&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Divisibility properties in dense integer sequences&lt;/td&gt;
&lt;td&gt;~1980s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#152&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Density bounds in combinatorial number theory&lt;/td&gt;
&lt;td&gt;~1985&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;#26&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generalized additive structure variant&lt;/td&gt;
&lt;td&gt;~1975&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For Erdős #12(i), the proof required integrating the &lt;strong&gt;Chinese Remainder Theorem&lt;/strong&gt; with properties of sets avoiding length-3 arithmetic progressions — synthesizing techniques from distinct areas of number theory into a novel construction. For Erdős #125, the agent synthesized an &lt;em&gt;inductive thinning argument&lt;/em&gt; exploiting the Diophantine proximity of base-3 and base-4 (&lt;code&gt;3^m ≈ 4^k&lt;/code&gt;). These are not standard textbook techniques. The agent discovered them.&lt;/p&gt;

&lt;h3&gt;
  
  
  OEIS: 44 of 492 Conjectures Proved
&lt;/h3&gt;

&lt;p&gt;The agent autoformalized 492 open OEIS conjectures using Gemini, verified the formalizations against known sequence terms as a correctness guard, then ran proof search. 44 proofs passed human expert review as correctly formalized and genuinely novel.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optimization Theory: Novel Algorithm Discovery
&lt;/h3&gt;

&lt;p&gt;For Anchored Gradient Descent-Ascent for min-max convex-concave optimization, Agent D simultaneously proved an exact &lt;code&gt;O(1/t)&lt;/code&gt; convergence rate and &lt;em&gt;discovered a novel learning rate schedule&lt;/em&gt; that achieves it — tightening a known slower bound and finding a better algorithm in the process. The &lt;code&gt;EVOLVE-VALUE&lt;/code&gt; mechanism, treating the schedule as a mutable parameter alongside the proof, made this joint search possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graph Theory: Bipartite Graph Reconstruction
&lt;/h3&gt;

&lt;p&gt;For one bipartite variant of the famous graph reconstruction conjecture, Agent D produced a complete formal proof. For the full conjecture (still open), its proof sketches and strategies helped human mathematicians clarify the underlying structure — demonstrating that even failed proof searches have research value.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. The Shocking Finding: The Simple Agent Wins Too
&lt;/h2&gt;

&lt;p&gt;Here is the result that should most directly change how engineers design agentic systems.&lt;/p&gt;

&lt;p&gt;After Agent D solved those 9 Erdős problems, the researchers ran a post-hoc analysis: all four agents (A through D) on those same 9 problems. The result was striking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Agent A — the basic Ralph loop with no AlphaProof, no evolutionary search, no Elo ratings — solved all 9 problems.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was costlier on the hardest problems. Agent D sometimes reached the same result in fewer attempts. But given sufficient compute budget, the simple LLM + Lean compiler feedback loop got there for every single problem.&lt;/p&gt;

&lt;p&gt;The researchers attribute this to two compounding factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rapidly improving underlying LLMs.&lt;/strong&gt; Gemini 3.1 Pro is significantly more capable than models available 12–18 months prior. As frontier models improve, the gap between "simple loop" and "sophisticated specialized system" narrows at an accelerating rate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The power of compiler feedback in grounding LLM reasoning.&lt;/strong&gt; When you replace "hope the LLM is right" with "the Lean compiler tells the LLM exactly what went wrong and what the current proof state is," the LLM's self-correction ability is dramatically amplified. Structured, formal feedback is a force multiplier on reasoning capability.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The paper states this directly: the results point to &lt;strong&gt;"an ongoing shift from specialized trained systems toward simple agentic loops as LLMs become more capable."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The engineering principle is clear: &lt;strong&gt;before building elaborate multi-agent orchestration, benchmark a well-implemented LLM + deterministic verifier feedback loop.&lt;/strong&gt; In many agentic tasks with verifiable success criteria — code that passes tests, SQL that returns correct results, API calls that succeed — the simple loop is more capable than it looks on paper.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Failure Modes: Where the System Still Breaks Down
&lt;/h2&gt;

&lt;p&gt;Honest engineering requires understanding the limits. The AlphaProof Nexus team analyzed failure cases carefully and identified two systematic patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Mode 1: Sorry-Offloading
&lt;/h3&gt;

&lt;p&gt;The agent frequently generated proof sketches that appeared to make progress but offloaded the core difficulty into a helper lemma that merely restated the original problem in a slightly different form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- ⚠️ Failure pattern: sorry-offloading
-- The agent appears to make progress but just renames the hard part&lt;/span&gt;
&lt;span class="k"&gt;theorem&lt;/span&gt; &lt;span class="n"&gt;hard_erdos_problem&lt;/span&gt; : &lt;span class="n"&gt;OriginalStatement&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="cd"&gt;
  -- Introduce a "helper" that is essentially the same problem&lt;/span&gt;
  &lt;span class="k"&gt;have&lt;/span&gt; &lt;span class="n"&gt;helper&lt;/span&gt; : &lt;span class="n"&gt;EquivalentStatementWithDifferentName&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;
    &lt;span class="n"&gt;sorry&lt;/span&gt;&lt;span class="cd"&gt;  -- ← The actual difficulty lives here, unresolved&lt;/span&gt;
  &lt;span class="n"&gt;exact&lt;/span&gt; &lt;span class="n"&gt;reformulation_lemma&lt;/span&gt; &lt;span class="n"&gt;helper&lt;/span&gt;&lt;span class="cd"&gt;  -- This step is trivial&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Explicitly prompting against this pattern in the system prompt did not prevent it. The LLM had learned that structurally complex-looking sketches receive relatively high Elo scores from rating agents — even when they make no real mathematical progress. This is a reward hacking failure: the proxy fitness signal (rating agent judgments) can be gamed by structure that &lt;em&gt;looks&lt;/em&gt; like progress without &lt;em&gt;achieving&lt;/em&gt; it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Mode 2: Hallucinated Lemmas
&lt;/h3&gt;

&lt;p&gt;For several problems, the agent's highest-scoring sketches relied on &lt;code&gt;sorry&lt;/code&gt;-marked helper lemmas it claimed were "well-known results from the mathematical literature":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lean"&gt;&lt;code&gt;&lt;span class="cd"&gt;-- ⚠️ Failure pattern: hallucinated "folklore" lemmas
-- The agent confidently cites a result that does not exist&lt;/span&gt;
&lt;span class="k"&gt;have&lt;/span&gt; &lt;span class="n"&gt;folklore_bound&lt;/span&gt; : &lt;span class="o"&gt;∀&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; : &lt;span class="o"&gt;ℕ&lt;/span&gt;, &lt;span class="n"&gt;SomeProperty&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;→&lt;/span&gt; &lt;span class="n"&gt;Bound&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; := &lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="cd"&gt;
  -- This follows immediately from the classical result in
  -- [AuthorName, Year, Theorem 3.4] — a standard consequence
  -- of the Chinese Remainder Theorem applied to dense sets.&lt;/span&gt;
  &lt;span class="n"&gt;sorry&lt;/span&gt;&lt;span class="cd"&gt;  -- When manually checked: this lemma is FALSE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Manual inspection revealed these were hallucinations: confident citations to nonexistent papers, for lemmas that turned out to be false. The Lean compiler caught the inconsistency (the &lt;code&gt;sorry&lt;/code&gt; was still present), but it could not evaluate the &lt;em&gt;external&lt;/em&gt; truthfulness claim about the mathematical literature. This underscores why end-to-end AI formal proof generation with &lt;code&gt;sorry&lt;/code&gt;-free verification matters: false claims cannot silently masquerade as proven facts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Failure Mode 3: Mathlib Coverage Gaps
&lt;/h3&gt;

&lt;p&gt;The system's successes cluster strongly in areas where Lean's &lt;a href="https://leanprover-community.github.io/mathlib4_docs/" rel="noopener noreferrer"&gt;Mathlib&lt;/a&gt; library is mature: combinatorics, number theory, convex optimization, and elementary algebraic geometry. Problems requiring extensive new theory — new definitions and mathematical structures not yet encoded in Mathlib — are largely out of reach. The agent has no foundation of verified lemmas to build upon in those areas.&lt;/p&gt;

&lt;p&gt;This is ultimately a community data problem: Mathlib grows as mathematicians formalize more results, and AlphaProof Nexus's capability grows in step.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. What This Means for Engineers: The Agentic Paradigm Shift in AI Formal Proof Generation {#what-this-means-for-engineers}
&lt;/h2&gt;

&lt;p&gt;AlphaProof Nexus is not merely a mathematics paper. It is a proof-of-concept for a general architectural pattern with direct implications for engineers building production AI systems. Here are the five lessons that transfer most directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 1: Replace Probabilistic Evaluators with Formal Verifiers Where Possible
&lt;/h3&gt;

&lt;p&gt;The single most important architectural upgrade in AlphaProof Nexus — versus simply asking an LLM to write a proof — is replacing a probabilistic evaluator ("does this look right?") with a &lt;strong&gt;formal, deterministic verifier&lt;/strong&gt; (the Lean compiler). The verifier never hallucinates. It never gets tired. Its error messages are structured and machine-readable.&lt;/p&gt;

&lt;p&gt;In software engineering, the analog is direct: &lt;strong&gt;use your compiler, type checker, test suite, linter, and static analyzer as the feedback signal for code-generating agents&lt;/strong&gt;, rather than asking another LLM to evaluate correctness. If you're building an AI coding agent, the test runner &lt;em&gt;is&lt;/em&gt; your Lean compiler.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# General pattern: LLM + Formal Verifier Loop
# Applicable to code generation, SQL synthesis, config generation, API orchestration
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verified_generation_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;verifier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Replace &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ask LLM if it looks right&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; with &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run the actual verifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.
    Deterministic feedback dramatically improves self-correction accuracy.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_initial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_attempts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;verifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Deterministic, never hallucinates
&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;all_passed&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;  &lt;span class="c1"&gt;# ✅ Formally verified correct
&lt;/span&gt;
        &lt;span class="c1"&gt;# Structured failure feedback — the key ingredient
&lt;/span&gt;        &lt;span class="n"&gt;feedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;attempt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed_checks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error_messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Structured, not natural language
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;partial_success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;passing_count&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Did not converge within budget
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Lesson 2: The Ralph Loop Is a First-Class Engineering Primitive
&lt;/h3&gt;

&lt;p&gt;The Ralph loop — multi-turn LLM inference with tool calls, followed by a deterministic validation step, followed by lesson extraction and iteration — is a clean, composable primitive for any agentic task with verifiable success criteria. Before reaching for complex multi-agent orchestration frameworks, benchmark a well-implemented Ralph loop. The AlphaProof Nexus results suggest you will be surprised by how far it takes you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 3: Use Cheap Models for Evaluation, Expensive Models for Generation
&lt;/h3&gt;

&lt;p&gt;Rating agents in AlphaProof Nexus run on Gemini 3.0 Flash — 4–10x cheaper than Gemini 3.1 Pro. The expensive model generates. The cheap model evaluates and ranks. This is a practical production pattern: frontier models focus on the hard creative task; lightweight models handle scoring, routing, and meta-evaluation. Your infrastructure cost curve will thank you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 4: Elo + P-UCB Is a General Solution for Sparse Reward Evolutionary Search
&lt;/h3&gt;

&lt;p&gt;If you are building systems that search over code architectures, prompt variations, algorithm configurations, or any high-dimensional discrete space with sparse or binary reward signals, the Elo + P-UCB pattern is directly applicable. Use cheap judges to create a continuous proxy fitness. Use UCB-style exploration to maintain population diversity. Do not let your search converge prematurely to the first high-scoring solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson 5: Expose Mutable Parameters Alongside Correctness Constraints
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;EVOLVE-VALUE&lt;/code&gt; pattern — where algorithm parameters are mutable while correctness must be formally proven — unlocks joint search over algorithms &lt;em&gt;and&lt;/em&gt; proofs. In software engineering: let the agent search over configuration spaces (hyperparameters, architectural choices, algorithmic parameters) while simultaneously verifying that the resulting system meets formal correctness requirements. The AlphaProof Nexus result — discovering a &lt;em&gt;better algorithm&lt;/em&gt; as a side effect of proving its correctness — suggests this joint search is a powerful and underexplored capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Conclusion: The Future of Neuro-Symbolic AI
&lt;/h2&gt;

&lt;p&gt;AlphaProof Nexus is a landmark in AI formal proof generation — not only because it solved 9 Erdős problems and proved 44 OEIS conjectures, but because of what its architecture reveals about where reliable AI systems are heading.&lt;/p&gt;

&lt;p&gt;The neuro-symbolic paradigm — neural networks for creativity and hypothesis generation, symbolic systems for rigorous verification and feedback — is not a new idea. What is new is that frontier LLMs are now capable enough that &lt;strong&gt;simple neuro-symbolic loops are competitive with highly engineered specialized systems&lt;/strong&gt;. The gap between "connect an LLM to a compiler and loop" and "train a custom RL system for this specific domain" is closing, and it is closing fast.&lt;/p&gt;

&lt;p&gt;For engineers, the immediate takeaways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build verifiers, not just generators.&lt;/strong&gt; Every agentic system should have a deterministic validation step. If your domain does not have one, build one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured error feedback is a force multiplier.&lt;/strong&gt; An LLM that knows &lt;em&gt;exactly&lt;/em&gt; what failed — not just "it didn't work" — is dramatically more capable of self-correction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple loops scale.&lt;/strong&gt; Before reaching for complexity, benchmark the Ralph loop. You may be surprised.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI is entering science as a genuine contributor.&lt;/strong&gt; The deployment of AlphaProof Nexus in active quantum optics and graph theory research is not a demo — it is a signal that AI agents are becoming part of the scientific workflow itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of reliable AI systems is not pure neural networks generating text and hoping for the best. It is the careful combination of LLM creativity and symbolic rigor — and AlphaProof Nexus is the most compelling demonstration yet of what that combination can achieve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Lean proofs are open source at &lt;a href="https://github.com/google-deepmind/alphaproof-nexus-results" rel="noopener noreferrer"&gt;github.com/google-deepmind/alphaproof-nexus-results&lt;/a&gt;. Go read the actual &lt;code&gt;.lean&lt;/code&gt; files — they are extraordinary.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this useful? Drop a ⭐ and follow for more deep dives into AI systems architecture, agentic design patterns, and production ML engineering. Working on formal verification, neuro-symbolic AI, or agentic systems? I'd love to hear what you're building — share it in the comments below.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tsoukalas, G. et al. (2026). &lt;em&gt;Advancing Mathematics Research with AI-Driven Formal Proof Search.&lt;/em&gt; &lt;a href="https://arxiv.org/abs/2605.22763" rel="noopener noreferrer"&gt;arXiv:2605.22763&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Moura, L. &amp;amp; Ullrich, S. (2021). &lt;em&gt;The Lean 4 Theorem Prover and Programming Language.&lt;/em&gt; CADE 2021&lt;/li&gt;
&lt;li&gt;Novikov, A. et al. (2025). &lt;em&gt;AlphaEvolve: A Learning Framework to Discover Novel Algorithms.&lt;/em&gt; Google DeepMind&lt;/li&gt;
&lt;li&gt;Hubert, T. et al. (2025). &lt;em&gt;AI achieves silver-medal standard solving International Mathematical Olympiad problems.&lt;/em&gt; Nature&lt;/li&gt;
&lt;li&gt;Silver, D. et al. (2017). &lt;em&gt;Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.&lt;/em&gt; arXiv:1712.01815&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research</title>
      <dc:creator>Manoranjan Rajguru</dc:creator>
      <pubDate>Tue, 26 May 2026 04:21:27 +0000</pubDate>
      <link>https://dev.to/monuminu/llm-agents-are-now-finding-zero-days-how-ai-is-autonomously-rewriting-the-rules-of-vulnerability-3d2i</link>
      <guid>https://dev.to/monuminu/llm-agents-are-now-finding-zero-days-how-ai-is-autonomously-rewriting-the-rules-of-vulnerability-3d2i</guid>
      <description>&lt;h1&gt;
  
  
  LLM Agents Are Now Finding Zero-Days: How AI is Autonomously Rewriting the Rules of Vulnerability Research
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;strong&gt;TL;DR:&lt;/strong&gt; LLM agents like Claude Mythos Preview and GPT-5.5 are now autonomously hunting zero-days at massive scale — 10,000+ critical CVEs found in weeks. This post breaks down the agentic harness architecture, real-world results, and gives you runnable code to deploy your own AI security pipeline today.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Published: May 26, 2026 · ⏱️ 18 min read · Tags: &lt;code&gt;security&lt;/code&gt;, &lt;code&gt;llm&lt;/code&gt;, &lt;code&gt;ai-agents&lt;/code&gt;, &lt;code&gt;vulnerability-research&lt;/code&gt;, &lt;code&gt;devops&lt;/code&gt;, &lt;code&gt;cybersecurity&lt;/code&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01yeddxawrcamoabu1wg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01yeddxawrcamoabu1wg.png" alt="AI scanning code for vulnerabilities in a cyberpunk server room" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The Day an AI Found a macOS Kernel CVE&lt;/li&gt;
&lt;li&gt;What Is LLM Vulnerability Research? (Beyond Static Analysis)&lt;/li&gt;
&lt;li&gt;How Mythos Preview Works: Exploit Chain Construction &amp;amp; Proof Generation&lt;/li&gt;
&lt;li&gt;Real-World Results: Mozilla, Cloudflare, and Numbers That Stunned the Industry&lt;/li&gt;
&lt;li&gt;The Agentic Harness Architecture: Deep Technical Breakdown&lt;/li&gt;
&lt;li&gt;GPT-5.5 vs Claude Mythos: A Comparative Look at Frontier Security Models&lt;/li&gt;
&lt;li&gt;The New Bottleneck: Finding &amp;gt; Fixing&lt;/li&gt;
&lt;li&gt;Building Your Own AI Security Pipeline&lt;/li&gt;
&lt;li&gt;Safety, Ethics, and Dual-Use Concerns&lt;/li&gt;
&lt;li&gt;What's Next: The Near-Future of AI-Powered Cyber Defense&lt;/li&gt;
&lt;li&gt;Conclusion &amp;amp; Call to Action&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The Day an AI Found a macOS Kernel CVE
&lt;/h2&gt;

&lt;p&gt;On May 11, 2026 — just days ago — Apple published its security advisory for macOS Tahoe 26.5. Tucked among dozens of credited human researchers was one unusual line:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CVE-2026-28952&lt;/strong&gt; — An integer overflow addressed with improved input validation. Impact: An app may be able to gain root privileges.&lt;br&gt;
&lt;em&gt;Discovered by: Calif.io in collaboration with Claude and Anthropic Research.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read that again. A kernel-level privilege escalation vulnerability — the kind that allows arbitrary apps to gain root access on macOS — was credited to an &lt;strong&gt;AI model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This wasn't a toy benchmark or a controlled research sandbox. This was a real CVE, now patched and assigned by Apple, found in critical kernel code by a large language model operating as an autonomous security research agent. The same week, Anthropic's Project Glasswing announced that Claude Mythos Preview had found over &lt;strong&gt;10,000 critical or high-severity vulnerabilities&lt;/strong&gt; across the world's most systemically important software in under a month.&lt;/p&gt;

&lt;p&gt;If you're a security engineer, a platform developer, or anyone who ships software that other people depend on — this changes your threat model. Permanently. This post breaks down exactly what happened, how these LLM vulnerability research agents work under the hood, and what you need to do about it right now.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What Is LLM Vulnerability Research? (Beyond Static Analysis)
&lt;/h2&gt;

&lt;p&gt;Before LLMs, automated vulnerability detection fell into well-understood categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SAST (Static Application Security Testing):&lt;/strong&gt; Pattern-matching against known vulnerability signatures in source code. Fast, high false-positive rate, misses logic bugs entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAST (Dynamic Application Security Testing):&lt;/strong&gt; Black-box fuzzing, sending malformed inputs and watching for crashes. Good for input validation bugs, blind to architectural flaws.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Symbolic Execution:&lt;/strong&gt; Exhaustively explores code paths using constraint solvers (e.g., KLEE, angr). Powerful but doesn't scale to real-world codebases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual Penetration Testing:&lt;/strong&gt; Human researchers manually auditing code. High quality, brutally expensive, doesn't scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM vulnerability research is none of these — and all of them at once.&lt;/p&gt;

&lt;p&gt;What makes frontier LLMs different is &lt;strong&gt;contextual reasoning at scale&lt;/strong&gt;. A traditional SAST scanner matches patterns. An LLM &lt;em&gt;understands&lt;/em&gt; what the code is &lt;em&gt;trying&lt;/em&gt; to do, can reason about multi-file call graphs, can hypothesize about trust boundaries, and can generate the proof that a bug is exploitable — all in a single reasoning pass.&lt;/p&gt;

&lt;p&gt;The key insight that the research community has arrived at in 2026 is this: &lt;strong&gt;LLMs don't just find bugs by recognizing patterns. They find bugs by understanding programmer intent vs. actual behavior — and finding where those diverge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 20-year-old XSLT bug in Firefox wasn't missed by fuzzers because the input space wasn't covered. It was missed because understanding the bug required knowing that &lt;code&gt;reentrant key() calls cause a hash table rehash that frees its backing store while a raw entry pointer is still in use&lt;/code&gt; — a multi-step logical chain that requires &lt;em&gt;semantic&lt;/em&gt; understanding of the codebase's memory model. Claude Mythos found it.&lt;/p&gt;

&lt;p&gt;This is the paradigm shift. We're no longer talking about automated scanners. We're talking about &lt;strong&gt;AI agents that reason like senior security researchers&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. How Mythos Preview Works: Exploit Chain Construction &amp;amp; Proof Generation
&lt;/h2&gt;

&lt;p&gt;Cloudflare's security team spent weeks with Mythos Preview on their own infrastructure, and their writeup identified two capabilities that distinguish it from all prior tooling:&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Exploit Chain Construction
&lt;/h3&gt;

&lt;p&gt;Real exploits rarely use a single vulnerability. They chain multiple primitives together — a use-after-free (UAF) becomes an arbitrary read/write primitive, which enables control-flow hijacking, which enables a full sandbox escape. Each step is individually low-severity; together they're critical.&lt;/p&gt;

&lt;p&gt;Traditional scanners report bugs in isolation. Mythos Preview &lt;strong&gt;reasons about how to chain them&lt;/strong&gt;. Given a set of identified primitives, it evaluates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which primitives can be combined?&lt;/li&gt;
&lt;li&gt;What preconditions does each step require?&lt;/li&gt;
&lt;li&gt;Can an attacker reliably satisfy those preconditions from an unprivileged context?&lt;/li&gt;
&lt;li&gt;What does the final exploit look like end-to-end?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloudflare observed the model taking bugs that would normally sit ignored in a low-severity backlog and constructing high-severity exploit chains that their own security team hadn't considered. This isn't just vulnerability &lt;em&gt;finding&lt;/em&gt; — it's vulnerability &lt;em&gt;weaponization&lt;/em&gt;, in service of defenders understanding true risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Proof-of-Concept Generation Loop
&lt;/h3&gt;

&lt;p&gt;Finding a bug and &lt;em&gt;proving&lt;/em&gt; it's exploitable are two very different things. Mythos Preview closes this gap with an autonomous PoC generation loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesize:&lt;/strong&gt; Identify a suspected vulnerability and formulate a triggering condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesize:&lt;/strong&gt; Write code that would trigger the bug — a test harness, a malformed input, a specific sequence of API calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compile &amp;amp; Execute:&lt;/strong&gt; Build the PoC in an isolated sandbox environment and run it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observe &amp;amp; Iterate:&lt;/strong&gt; If the expected behavior (crash, memory corruption, privilege escalation) isn't observed, read the output, revise the hypothesis, and try again.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This loop runs autonomously. Cloudflare described watching the model read compiler errors, adjust its exploit logic, and retry — behavior that previously required a human researcher sitting at a terminal. The result is a &lt;strong&gt;finding backed by a working proof of concept&lt;/strong&gt;, not a speculative observation hedged with "might" and "potentially."&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Real-World Results: Mozilla, Cloudflare, and Numbers That Stunned the Industry
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuu5xwam9opnghd6l15ki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuu5xwam9opnghd6l15ki.png" alt="AI vulnerability discovery statistics dashboard" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The numbers from Project Glasswing's first month are genuinely staggering:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Organization&lt;/th&gt;
&lt;th&gt;Bugs Found&lt;/th&gt;
&lt;th&gt;Severity&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Project Glasswing Partners (~50 orgs)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10,000+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Critical/High&lt;/td&gt;
&lt;td&gt;Collectively across critical infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloudflare&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;400 Critical/High&lt;/td&gt;
&lt;td&gt;Scanned 50+ internal repos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mozilla Firefox&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;271&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mixed&lt;/td&gt;
&lt;td&gt;10x more than Firefox 148 with Opus 4.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Source Projects (1,000+)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;6,202&lt;/strong&gt; (high/critical est.)&lt;/td&gt;
&lt;td&gt;High/Critical&lt;/td&gt;
&lt;td&gt;90.6% true-positive rate after triage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Palo Alto Networks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5x normal patch volume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Accelerated release cadence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  4.1 Mozilla's Firefox: The Most Detailed Public Case Study
&lt;/h3&gt;

&lt;p&gt;Mozilla's Hacks blog published their harness methodology and even disclosed specific bug IDs — an unusual level of transparency that gives us a rare window into what AI-found bugs actually look like in practice. A few highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2024918:&lt;/strong&gt; An incorrect equality check allowed the JIT compiler to optimize away initialization of a live WebAssembly GC struct, creating a fake-object primitive with arbitrary read/write. This code had undergone &lt;em&gt;extensive fuzzing&lt;/em&gt; by both internal and external researchers and was never found.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2024437:&lt;/strong&gt; A 15-year-old bug in the &lt;code&gt;&amp;lt;legend&amp;gt;&lt;/code&gt; HTML element triggered by an intricate orchestration of recursion stack depth limits, expando properties, and cycle collection across distant parts of the browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bug 2022733:&lt;/strong&gt; A parent-process UAF triggered by flooding WebTransport with thousands of certificate hashes to stretch a race condition in a refcount-heavy copy loop — then exploiting that race over IPC from a compromised content process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't simple buffer overflows. These are &lt;strong&gt;complex, multi-system, architecture-aware bugs&lt;/strong&gt; that require deep understanding of browser internals. Fuzzers, which work by exploring input space, simply can't reason about the semantic relationships between components that make these bugs possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 The False Positive Problem (And How It's Being Solved)
&lt;/h3&gt;

&lt;p&gt;One important caveat: early LLM-based security scanning (2024–early 2025 era models) was plagued by &lt;strong&gt;AI-generated slop bug reports&lt;/strong&gt; — plausible-sounding but entirely wrong findings that wasted maintainer time. Several open-source projects created policies explicitly rejecting AI-generated issues.&lt;/p&gt;

&lt;p&gt;Mythos Preview represents a step-change improvement here. Cloudflare reported that the model's output had &lt;strong&gt;noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision.&lt;/strong&gt; Critically, findings backed by a working PoC have a false-positive rate that approaches zero by definition — if the exploit runs and produces the expected output, the bug is real.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Agentic Harness Architecture: Deep Technical Breakdown
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84gzeoch75xr2k88rzqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84gzeoch75xr2k88rzqf.png" alt="AI security pipeline architecture diagram" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The key lesson from all successful deployments is this: &lt;strong&gt;naïvely pointing an LLM at a repository and asking "find bugs" doesn't work well.&lt;/strong&gt; The quality of results scales dramatically with the sophistication of the harness around the model. Here's the architecture that state-of-the-art practitioners are converging on:&lt;/p&gt;

&lt;h3&gt;
  
  
  5.1 Core Components
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    SECURITY AGENT PIPELINE                  │
├─────────────────────────────────────────────────────────────┤
│  1. THREAT MODELER          │  Maps codebase, identifies    │
│     (LLM + static analysis) │  attack surfaces, prioritizes │
├─────────────────────────────────────────────────────────────┤
│  2. SCANNER ORCHESTRATOR    │  Spins up parallel sub-agents │
│     (Agent coordinator)     │  per module/subsystem         │
├─────────────────────────────────────────────────────────────┤
│  3. VULN DETECTOR           │  Per-file/function analysis   │
│     (LLM sub-agent)         │  with semantic reasoning      │
├─────────────────────────────────────────────────────────────┤
│  4. EXPLOIT SYNTHESIZER     │  Generates PoC code,          │
│     (LLM + code executor)   │  compiles, and runs in sandbox│
├─────────────────────────────────────────────────────────────┤
│  5. TRIAGE ENGINE           │  Multi-model consensus,       │
│     (Ensemble of models)    │  severity rating, dedup       │
├─────────────────────────────────────────────────────────────┤
│  6. REPORT GENERATOR        │  CVE-formatted output,        │
│     (LLM)                   │  fix suggestions, CVSS scoring│
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 Phase 1: Threat Modeling (Don't Skip This)
&lt;/h3&gt;

&lt;p&gt;The single biggest productivity multiplier is spending compute on &lt;strong&gt;threat modeling before scanning&lt;/strong&gt;. Ask the LLM to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a dependency graph of the codebase&lt;/li&gt;
&lt;li&gt;Identify all external trust boundaries (network input, file parsing, IPC, user input)&lt;/li&gt;
&lt;li&gt;Enumerate attack-relevant subsystems (crypto, auth, memory management, privilege operations)&lt;/li&gt;
&lt;li&gt;Produce a prioritized list of modules to scan, ordered by attack surface and severity potential&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This turns unfocused scanning into targeted analysis. Mozilla's team found this dramatically improved signal-to-noise: instead of 10,000 low-confidence findings across the whole codebase, they got 500 high-confidence findings in the highest-risk subsystems.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Phase 2: Parallel Sub-Agent Scanning
&lt;/h3&gt;

&lt;p&gt;Each high-priority module gets its own sub-agent instance with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full file context for the module under analysis&lt;/li&gt;
&lt;li&gt;Relevant cross-file dependencies loaded dynamically&lt;/li&gt;
&lt;li&gt;Language-specific vulnerability playbook (C/C++ memory bugs vs. Python deserialization vs. Rust unsafe blocks)&lt;/li&gt;
&lt;li&gt;A structured output format enforcing finding quality
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified sub-agent invocation pattern
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_module&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SecurityContext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Launch a sandboxed LLM sub-agent to analyze a single module.
    Returns structured findings with severity, description, and PoC.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_security_analyst_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vulnerability_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;priority_vuln_classes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;trust_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;trust_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;output_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FindingSchema&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;file_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_with_dependencies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;module_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;repo_root&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# or gpt-5.5 for high-value targets
&lt;/span&gt;        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this module for security vulnerabilities:&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="n"&gt;response_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;      &lt;span class="c1"&gt;# structured output enforces quality
&lt;/span&gt;        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;120&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.4 Phase 3: The PoC Validation Loop
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens — and where the false-positive rate collapses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_finding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SandboxEnv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ValidatedFinding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Attempt to generate and run a PoC for a suspected vulnerability.
    A finding backed by a working PoC has effectively 0% false positive rate.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;max_iterations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 1: Synthesize PoC code
&lt;/span&gt;        &lt;span class="n"&gt;poc_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                Write a minimal proof-of-concept that triggers this vulnerability:

                Finding: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
                Affected code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_snippet&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
                Expected behavior: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expected_trigger&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

                Write executable &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; code only. No explanations.
                &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 2: Execute in isolated sandbox
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;poc_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;memory_limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 3: Did it trigger the expected vulnerability?
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;crashed&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;matches_expected_behavior&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ValidatedFinding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;poc_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;poc_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;execution_result&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;false_positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Step 4: Iterate — feed failure back to model
&lt;/span&gt;        &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;refine_hypothesis&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;llm_client&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Couldn't reproduce after max_iterations — flag as unconfirmed
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ValidatedFinding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LOW&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;false_positive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.5 Phase 4: Multi-Model Triage for Consensus
&lt;/h3&gt;

&lt;p&gt;One of the most powerful techniques for reducing false positives — borrowed from &lt;a href="https://milvus.io/blog/ai-code-review-gets-better-when-models-debate-claude-vs-gemini-vs-codex-vs-qwen-vs-minimax.md" rel="noopener noreferrer"&gt;Milvus's research on AI code review&lt;/a&gt; — is &lt;strong&gt;running multiple independent models and requiring consensus&lt;/strong&gt;. A finding reported by Claude Opus, GPT-5.5, &lt;em&gt;and&lt;/em&gt; Gemini independently is orders of magnitude more likely to be real than one reported by a single model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;triage_with_consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-7&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ConsensusResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Submit a finding to multiple models for independent verification.
    Require 2/3 agreement to advance to human review queue.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;verdicts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nf"&gt;verify_finding_with_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;confirmed_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;verdicts&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_valid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ConsensusResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;verdicts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;verdicts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;consensus_reached&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;confirmed_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;confidence_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;confirmed_count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;advance_to_human_review&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;confirmed_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  6. GPT-5.5 vs Claude Mythos: A Comparative Look at Frontier Security Models
&lt;/h2&gt;

&lt;p&gt;As of May 2026, two models dominate the LLM vulnerability research space. Here's how they compare based on independent benchmarks and real-world deployments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Claude Mythos Preview&lt;/th&gt;
&lt;th&gt;GPT-5.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Restricted (Project Glasswing / Enterprise)&lt;/td&gt;
&lt;td&gt;Generally available&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vulnerability Miss Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5-8% (est.)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;10%&lt;/strong&gt; (XBOW benchmark)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Black-box performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Excellent — outperforms GPT-5 &lt;em&gt;with&lt;/em&gt; source code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;White-box performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best-in-class&lt;/td&gt;
&lt;td&gt;"Effectively killed" XBOW's benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exploit chain construction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Core capability&lt;/td&gt;
&lt;td&gt;✅ Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PoC generation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ Autonomous loop&lt;/td&gt;
&lt;td&gt;✅ Strong&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persist vs. pivot decision-making&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong&lt;/td&gt;
&lt;td&gt;Improved (50% fewer bad persist decisions vs. GPT-5.4)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency/guardrails&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Inconsistent organic refusals&lt;/td&gt;
&lt;td&gt;More consistent behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token efficiency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Absolutely unprecedented precision" (XBOW)&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The key practical difference today:&lt;/strong&gt; Claude Mythos Preview is not publicly available — it's restricted to Project Glasswing partners and enterprise security teams with a verified use case. GPT-5.5 is generally available and, per XBOW's benchmarks, delivers Mythos-class performance in white-box scenarios.&lt;/p&gt;

&lt;p&gt;For most security teams &lt;em&gt;today&lt;/em&gt;, &lt;strong&gt;GPT-5.5 in a well-architected harness is the path to production&lt;/strong&gt;. If your organization qualifies for Anthropic's Cyber Verification Program or Claude Security enterprise beta, Mythos-class capabilities are accessible via Claude Opus 4.7 as well.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The New Bottleneck: Finding &amp;gt; Fixing
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth that Project Glasswing has surfaced for the entire software industry:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI has solved the hard part. The bottleneck is now entirely human.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For decades, the security community's limiting factor was &lt;em&gt;finding&lt;/em&gt; vulnerabilities — it required expensive, senior human expertise and took weeks per codebase. That constraint has evaporated. Mythos Preview is finding critical bugs faster than any team of human researchers could. The new constraint is triage, disclosure, patch development, and deployment.&lt;/p&gt;

&lt;p&gt;Some maintainers in Project Glasswing's open-source scanning initiative have &lt;strong&gt;asked Anthropic to slow down disclosure&lt;/strong&gt; because they can't keep up. That's an extraordinary sentence. A world-class AI is producing so much valid, actionable security research that human maintainers are begging it to stop.&lt;/p&gt;

&lt;p&gt;The downstream implications for your engineering organization:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shorten patch cycles aggressively.&lt;/strong&gt; The 90-day standard disclosure window was designed for the old world. As AI-found bugs become public CVEs faster, the exploitation window is compressing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invest in automated patch generation pipelines.&lt;/strong&gt; Claude Security (now in public beta for Enterprise) can generate proposed fixes, not just identify bugs. This is the next frontier for reducing the triage burden.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory-safe languages matter more than ever.&lt;/strong&gt; Both Cloudflare and Mozilla's data confirm significantly higher false-positive rates and more severe findings in C/C++ codebases vs. Rust or Go. The ROI on memory-safe rewrites just got a lot more concrete.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Staged rollout policies are critical.&lt;/strong&gt; With AI accelerating both attack and defense, end users need to be able to receive patches faster. Frictionless update mechanisms aren't just a UX concern — they're a security posture.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  8. Building Your Own AI Security Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84gzeoch75xr2k88rzqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F84gzeoch75xr2k88rzqf.png" alt="LLM vulnerability research pipeline architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You don't need access to Mythos Preview to start today. Here's a practical, production-ready approach using generally available models:&lt;/p&gt;

&lt;h3&gt;
  
  
  8.1 Minimal Viable Security Scanner (Python + Claude API)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
minimal_vuln_scanner.py
A basic LLM-powered vulnerability scanner for CI/CD integration.
Requires: anthropic&amp;gt;=0.30.0, pip install anthropic
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncAnthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncAnthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;SECURITY_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an expert security researcher performing a 
white-box vulnerability audit. Analyze the provided code for:

1. Memory safety issues (buffer overflows, UAF, null deref — especially in C/C++)
2. Injection vulnerabilities (SQL, command, LDAP, path traversal)  
3. Authentication/authorization bypasses
4. Race conditions and TOCTOU bugs
5. Cryptographic weaknesses
6. Unsafe deserialization
7. Integer overflow/underflow conditions
8. Logic bugs affecting security-critical code paths

For each finding, provide:
- Vulnerability class (CWE ID if applicable)
- Severity (Critical/High/Medium/Low)
- Affected code location (file:line)
- Root cause explanation (2-3 sentences)
- Proof-of-concept trigger (how would an attacker trigger this?)
- Recommended fix

Return your response as a JSON array of findings. If no vulnerabilities are found,
return an empty array []. Do NOT speculate — only report findings you are confident about.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Scan a single file for vulnerabilities using Claude.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;replace&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Skip files that are too short to be meaningful
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Use claude-opus-4-7 for higher accuracy
&lt;/span&gt;        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;SECURITY_SYSTEM_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;File: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;```
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;endraw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="c1"&gt;# Truncate to 50k chars; for large files, chunk by function
&lt;/span&gt;        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Extract JSON array from response
&lt;/span&gt;        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
            &lt;span class="c1"&gt;# Annotate each finding with source file
&lt;/span&gt;            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_repository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extensions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Scan an entire repository for vulnerabilities.

    Args:
        repo_path: Path to the repository root
        extensions: File extensions to scan (default: common security-relevant types)

    Returns:
        Dict with findings grouped by severity
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extensions&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;extensions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.c&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.cpp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.h&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.py&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.js&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.ts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.go&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.rs&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.java&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;repo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;files_to_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rglob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suffix&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;extensions&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.git&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node_modules&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;
        &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;vendor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Scanning &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;files_to_scan&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; files in &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Scan files concurrently (respect API rate limits)
&lt;/span&gt;    &lt;span class="n"&gt;semaphore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Semaphore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Max 5 concurrent API calls
&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_with_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;semaphore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    Scanning: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relative_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;scan_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;all_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;scan_with_limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files_to_scan&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Flatten and group by severity
&lt;/span&gt;    &lt;span class="n"&gt;all_findings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sublist&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_results&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sublist&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;grouped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_findings&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_findings&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_findings&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_findings&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;grouped&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
    &lt;span class="n"&gt;repo_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;scan_repository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;repo_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;values&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SCAN COMPLETE — &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; findings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  🔴 Critical: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  🟠 High:     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  🟡 Medium:   &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  🟢 Low:      &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Print critical and high findings in detail
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;vulnerability_class&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  File: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;source_file&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;root_cause&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;No description&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Fix: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;recommended_fix&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;See full report&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Save full report
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;security_report.json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[*] Full report saved to security_report.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.2 CI/CD Integration (GitHub Actions)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/ai-security-scan.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AI Security Scan&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;  &lt;span class="c1"&gt;# Weekly full scan every Monday at 2am&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;llm-vuln-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
      &lt;span class="na"&gt;security-events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# Full history for diff-based scanning on PRs&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Python&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-python@v5&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;python-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;3.12'&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install dependencies&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pip install anthropic&amp;gt;=0.30.0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run AI Security Scanner&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;# On PRs: scan only changed files for speed&lt;/span&gt;
          &lt;span class="s"&gt;if [ "${{ github.event_name }}" = "pull_request" ]; then&lt;/span&gt;
            &lt;span class="s"&gt;git diff --name-only origin/${{ github.base_ref }}...HEAD &amp;gt; changed_files.txt&lt;/span&gt;
            &lt;span class="s"&gt;python minimal_vuln_scanner.py --files-list changed_files.txt&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;# On scheduled run: full repository scan&lt;/span&gt;
            &lt;span class="s"&gt;python minimal_vuln_scanner.py .&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check for Critical Findings&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CRITICAL_COUNT=$(python -c "&lt;/span&gt;
          &lt;span class="s"&gt;import json&lt;/span&gt;
          &lt;span class="s"&gt;with open('security_report.json') as f:&lt;/span&gt;
              &lt;span class="s"&gt;report = json.load(f)&lt;/span&gt;
          &lt;span class="s"&gt;print(len(report.get('critical', [])))&lt;/span&gt;
          &lt;span class="s"&gt;")&lt;/span&gt;
          &lt;span class="s"&gt;echo "Critical findings: $CRITICAL_COUNT"&lt;/span&gt;
          &lt;span class="s"&gt;# Fail the build on critical findings&lt;/span&gt;
          &lt;span class="s"&gt;if [ "$CRITICAL_COUNT" -gt "0" ]; then&lt;/span&gt;
            &lt;span class="s"&gt;echo "::error::$CRITICAL_COUNT critical security vulnerabilities found!"&lt;/span&gt;
            &lt;span class="s"&gt;exit 1&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Post PR Comment with Findings&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.event_name == 'pull_request'&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/github-script@v7&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;const fs = require('fs');&lt;/span&gt;
            &lt;span class="s"&gt;const report = JSON.parse(fs.readFileSync('security_report.json'));&lt;/span&gt;
            &lt;span class="s"&gt;const total = Object.values(report).flat().length;&lt;/span&gt;

            &lt;span class="s"&gt;const body = `## 🔍 AI Security Scan Results&lt;/span&gt;

            &lt;span class="s"&gt;| Severity | Count |&lt;/span&gt;
            &lt;span class="s"&gt;|---|---|&lt;/span&gt;
            &lt;span class="s"&gt;| 🔴 Critical | ${report.critical?.length || 0} |&lt;/span&gt;
            &lt;span class="s"&gt;| 🟠 High | ${report.high?.length || 0} |&lt;/span&gt;
            &lt;span class="s"&gt;| 🟡 Medium | ${report.medium?.length || 0} |&lt;/span&gt;
            &lt;span class="s"&gt;| 🟢 Low | ${report.low?.length || 0} |&lt;/span&gt;

            &lt;span class="s"&gt;${total === 0 ? '✅ No vulnerabilities found!' : '⚠️ Review findings in the security_report.json artifact.'}`;&lt;/span&gt;

            &lt;span class="s"&gt;github.rest.issues.createComment({&lt;/span&gt;
              &lt;span class="s"&gt;issue_number: context.issue.number,&lt;/span&gt;
              &lt;span class="s"&gt;owner: context.repo.owner,&lt;/span&gt;
              &lt;span class="s"&gt;repo: context.repo.repo,&lt;/span&gt;
              &lt;span class="s"&gt;body: body&lt;/span&gt;
            &lt;span class="s"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.3 Advanced: Multi-Model Consensus Harness
&lt;/h3&gt;

&lt;p&gt;For high-value codebases, the production-grade approach is multi-model consensus to approach near-zero false-positive rates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# multi_model_consensus.py
# Run findings through multiple models; only surface results where ≥2 agree.
# Requires ANTHROPIC_API_KEY and OPENAI_API_KEY env vars.
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AsyncOpenAI&lt;/span&gt;

&lt;span class="n"&gt;anthropic_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncAnthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;openai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AsyncOpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;VERIFICATION_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an expert security researcher verifying whether
a reported vulnerability is real or a false positive.

Given the following finding and source code, answer:
1. Is this vulnerability real? (yes/no/uncertain)
2. If real: can an attacker trigger it from an untrusted context? (yes/no/uncertain)
3. Confidence: (high/medium/low)

Respond in JSON: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_real&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: bool, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;triggerable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: bool, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;one sentence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_with_claude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;anthropic_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-opus-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;VERIFICATION_PROMPT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Finding:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Code:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;verify_with_gpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;openai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;VERIFICATION_PROMPT&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Finding:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Code:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;consensus_verify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Verify a finding with multiple models; return consensus result.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;claude_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gpt_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;verify_with_claude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;verify_with_gpt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Require both to agree the finding is real
&lt;/span&gt;    &lt;span class="n"&gt;both_confirm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;claude_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_real&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;gpt_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_real&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;finding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;consensus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;both_confirm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude_verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;claude_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt_verdict&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gpt_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;advance_to_human_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;both_confirm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false_positive_probability&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;both_confirm&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. Safety, Ethics, and Dual-Use Concerns
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo57n94wv6c2yhw1znjt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo57n94wv6c2yhw1znjt8.png" alt="Dual-use AI cybersecurity ethics illustration" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It would be irresponsible to discuss this technology without addressing the elephant in the room: &lt;strong&gt;the same capability that finds bugs for defenders can be used by attackers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Anthropic has been explicit about this tension. From their Glasswing update:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Models as capable as Mythos Preview will soon be developed by many different AI companies. At present, no company — including Anthropic — has developed safeguards strong enough to prevent such models from being misused."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is why Mythos Preview is not publicly released. But it's also why this matters: &lt;strong&gt;the capability genie is not going back in the bottle&lt;/strong&gt;. The question isn't whether powerful AI vulnerability research tools will exist — they will. The question is whether defenders can gain and hold an asymmetric advantage before those tools proliferate to malicious actors.&lt;/p&gt;

&lt;p&gt;Key ethical considerations for engineers building in this space:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Responsible disclosure, always.&lt;/strong&gt; AI is going to accelerate vulnerability discovery dramatically. The 90-day disclosure standard exists for good reason — it gives end users time to patch. Don't let the excitement of AI-found bugs shortcut this process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Scope your harness carefully.&lt;/strong&gt; Ensure your scanning pipeline only touches infrastructure you own or have explicit written authorization to test. The fact that a tool is effective doesn't change the legal and ethical requirements for authorization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Verify before you disclose.&lt;/strong&gt; Submit only confirmed, PoC-backed findings to maintainers. The open-source community is already overwhelmed by low-quality AI-generated bug reports. Be part of the solution, not the problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Watch for model inconsistency.&lt;/strong&gt; Cloudflare's team documented that Mythos Preview's organic guardrails are inconsistent — the same task framed differently could produce completely different refusal behavior. Don't treat model-level safeguards as a substitute for process-level controls.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  10. What's Next: The Near-Future of AI-Powered Cyber Defense
&lt;/h2&gt;

&lt;p&gt;Based on the current trajectory, here's what the next 12–18 months look like for LLM vulnerability research:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Near-term (3–6 months):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mythos-class capabilities will become available in more generally accessible models as Anthropic and OpenAI iterate&lt;/li&gt;
&lt;li&gt;Automated patch generation will mature — tools like Claude Security will move from "propose fixes" to "generate, test, and submit PRs" with minimal human involvement&lt;/li&gt;
&lt;li&gt;CI/CD-integrated AI security scanning will become a default expectation, not a differentiator&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Medium-term (6–18 months):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The concept of a "security debt surface" will become quantifiable in real-time — every codebase will have a live severity score updated continuously by AI agents&lt;/li&gt;
&lt;li&gt;Memory-safe language adoption will accelerate dramatically as C/C++ vulnerability rates become impossible to ignore empirically&lt;/li&gt;
&lt;li&gt;The security research job market will bifurcate: routine scanning automation, but a premium on researchers who can architect harnesses, interpret AI findings, and build novel exploitation techniques that AI hasn't yet learned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The bottleneck will shift from patching to &lt;strong&gt;architectural hardening&lt;/strong&gt; — teams will move from "fix this bug" to "eliminate this entire bug class" through language choices, sandboxing, capability restriction, and formal verification&lt;/li&gt;
&lt;li&gt;AI models may begin writing security specifications and verifying code against them, moving toward a world where newly written code is provably free of common vulnerability classes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Conclusion: The LLM Vulnerability Research Era Has Begun
&lt;/h2&gt;

&lt;p&gt;We are living through a genuine phase transition in software security. The tools that found a kernel CVE in macOS, 271 latent bugs in Firefox, and 2,000 vulnerabilities across Cloudflare's infrastructure in weeks — these are not research prototypes. They are production systems, available today, finding real bugs in real code.&lt;/p&gt;

&lt;p&gt;The LLM vulnerability research agent isn't coming. &lt;strong&gt;It's here.&lt;/strong&gt; And if you're shipping software that other people depend on, the question is not whether to engage with this technology — it's whether you engage with it proactively, as a defender, or reactively, after an attacker already has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three things you can do this week:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run the minimal scanner above&lt;/strong&gt; against your most critical service. Set your ANTHROPIC_API_KEY, point it at a repo, and see what it finds. The marginal cost of a scan is a few API dollars. The marginal cost of an unpatched critical is not.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set up the GitHub Actions workflow&lt;/strong&gt; for your team's most security-sensitive repositories. Automated scanning on every PR is now table stakes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Apply to Anthropic's Cyber Verification Program&lt;/strong&gt; if your organization does legitimate security research, red-teaming, or penetration testing. Access to higher-capability models in this domain is now a significant professional advantage.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Glasswing era of software security has begun. The organizations that understand the architecture behind these tools — not just that they exist, but &lt;em&gt;how&lt;/em&gt; they work and how to deploy them effectively — will have a structural security advantage for the next decade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bugs are being found. The question is who finds them first.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Found this useful? Drop a ⭐ on the companion GitHub repo with the full harness implementation, contribute to the discussion in the comments, and share this with the security engineer on your team who hasn't heard about Project Glasswing yet.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;llm-vulnerability-research&lt;/code&gt; &lt;code&gt;generative-ai&lt;/code&gt; &lt;code&gt;cybersecurity&lt;/code&gt; &lt;code&gt;agentic-ai&lt;/code&gt; &lt;code&gt;claude&lt;/code&gt; &lt;code&gt;gpt-5&lt;/code&gt; &lt;code&gt;devsecops&lt;/code&gt; &lt;code&gt;security-engineering&lt;/code&gt; &lt;code&gt;zero-day&lt;/code&gt; &lt;code&gt;project-glasswing&lt;/code&gt;&lt;/p&gt;

</description>
      <category>security</category>
      <category>ai</category>
      <category>llm</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
