<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dimitrios Milonopoulos</title>
    <description>The latest articles on DEV Community by Dimitrios Milonopoulos (@dimimil).</description>
    <link>https://dev.to/dimimil</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3827638%2F628a69b9-1b91-4c63-93b6-f1d1aa683496.png</url>
      <title>DEV Community: Dimitrios Milonopoulos</title>
      <link>https://dev.to/dimimil</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dimimil"/>
    <language>en</language>
    <item>
      <title>License to Skill: Everything You Need to Take Your AI Agent Game to the Next Level</title>
      <dc:creator>Dimitrios Milonopoulos</dc:creator>
      <pubDate>Tue, 17 Mar 2026 15:14:16 +0000</pubDate>
      <link>https://dev.to/dimimil/license-to-skill-everything-you-need-to-take-your-ai-agent-game-to-the-next-level-3noi</link>
      <guid>https://dev.to/dimimil/license-to-skill-everything-you-need-to-take-your-ai-agent-game-to-the-next-level-3noi</guid>
      <description>&lt;p&gt;At &lt;a href="https://dryft.ai/" rel="noopener noreferrer"&gt;Dryft&lt;/a&gt;, we build systems that replicate human decisions in industrial operations, through a combination of AI agents and mathematical optimization and simulation. Our agents analyze data and enterpise context to provide actionable recommendations, all in real-time conversations with domain experts.&lt;/p&gt;

&lt;p&gt;All of our agents are built on &lt;a href="https://ai.pydantic.dev/" rel="noopener noreferrer"&gt;Pydantic AI&lt;/a&gt; (We are big fans of Pydantic ). In this article we will use Pydantic AI as our point of reference for building agents, but the techniques and the concepts mentioned in the article should be applicable to any agentic framework.&lt;/p&gt;

&lt;p&gt;The agents live under their own domain, as we follow Domain Driven Design (DDD). That means all core agent logic is implemented concretely in its own distinct domain, and not mixed with API routes, database models, or other concerns. This keeps the codebase clean and maintainable.&lt;/p&gt;

&lt;p&gt;This post will explore some agent structuring patterns, from the simplest pattern to the most complex, along with the reasoning behind when to use each. &lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
Agent Anatomy

&lt;ul&gt;
&lt;li&gt;1. Model &amp;amp; Settings&lt;/li&gt;
&lt;li&gt;2. Dependencies (Deps)&lt;/li&gt;
&lt;li&gt;3. System Prompt &amp;amp; Instructions&lt;/li&gt;
&lt;li&gt;4. Tools&lt;/li&gt;
&lt;li&gt;5. Output Type&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Progressive Complexity

&lt;ul&gt;
&lt;li&gt;Level 1: Simple Structured Output&lt;/li&gt;
&lt;li&gt;Level 2: Tools + Dynamic Context + Post-Processing&lt;/li&gt;
&lt;li&gt;Level 3: Full Agentic Workflow with Streaming&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

Modular Prompts

&lt;ul&gt;
&lt;li&gt;XML Over Markdown in Prompts&lt;/li&gt;
&lt;li&gt;Constants and Dynamic Sections&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Internationalization for LLM Agents&lt;/li&gt;

&lt;li&gt;

Agents Reason, Tools Compute

&lt;ul&gt;
&lt;li&gt;Feature Flags&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Streaming &amp;amp; Observability&lt;/li&gt;

&lt;li&gt;

Open Problems

&lt;ul&gt;
&lt;li&gt;Secure Code Execution by Agents&lt;/li&gt;
&lt;li&gt;Dynamic Context&lt;/li&gt;
&lt;li&gt;Testing Non-Deterministic Flows&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Closing Thoughts&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Agent Anatomy
&lt;/h2&gt;

&lt;p&gt;Every agent we build is composed of at least these five building blocks. Understanding these makes it straightforward to go from "I need an agent that does X" to a working implementation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sjuurtt7c3gto86ap2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sjuurtt7c3gto86ap2y.png" alt="Mermaid Diagram of Agent Anatomy" width="800" height="831"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Model &amp;amp; Settings
&lt;/h3&gt;

&lt;p&gt;We centralize model settings in a single config. Each model has predefined settings for temperature, max tokens, and (for reasoning models) reasoning effort level. These can of course be overridden at the agent level depending on the needs.&lt;/p&gt;

&lt;p&gt;The LLM provider is defined in a single factory method, and we fully manage its lifecycle — we do a lot of heavy asynchronous workflows and we want our agents to be &lt;a href="https://en.wikipedia.org/wiki/Thread_safety" rel="noopener noreferrer"&gt;thread safe&lt;/a&gt;. The idea is to be able to experiment with other providers via a single line of code.&lt;/p&gt;

&lt;p&gt;We distinguish between &lt;strong&gt;reasoning models&lt;/strong&gt; (like gpt-5.4 or o3(old news)) which have a &lt;code&gt;reasoning_effort&lt;/code&gt; parameter, and &lt;strong&gt;standard models&lt;/strong&gt; (like gpt-4.1) which use traditional temperature control. Choose reasoning models for complex analysis and cheaper models for simpler tasks like matching or classification.&lt;/p&gt;

&lt;p&gt;For a gentle introduction to temperature, max tokens, and LLM decoding we recommend the &lt;a href="https://huggingface.co/blog/how-to-generate" rel="noopener noreferrer"&gt;HuggingFace&lt;/a&gt; blog post. Reasoning effort is inspired by the concept of &lt;a href="https://proceedings.neurips.cc/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html" rel="noopener noreferrer"&gt;Chain-of-Thought prompting&lt;/a&gt;, which each LLM provider implements in their own way and describes in their docs. &lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dependencies (Deps)
&lt;/h3&gt;

&lt;p&gt;Every agent declares a &lt;strong&gt;dependency type&lt;/strong&gt; — a dataclass or Pydantic model that gets injected into tools at runtime via Pydantic AI's &lt;code&gt;RunContext&lt;/code&gt;. Deps are the agent's "working memory" across tool calls.&lt;/p&gt;

&lt;p&gt;They range in complexity depending on what the agent needs to do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimal&lt;/strong&gt;: A simple dataclass with just basic identifiers for the agent to run (e.g., a classification agent)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich&lt;/strong&gt;: A Pydantic model with computed fields and a factory method (e.g., an agent that pre-computes deltas so the LLM doesn't have to do arithmetic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full&lt;/strong&gt;: A dataclass with 30+ fields, different factory methods for hydration, and caching.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key pattern: &lt;a href="https://ai.pydantic.dev/dependencies/" rel="noopener noreferrer"&gt;deps&lt;/a&gt; start sparse and get hydrated by tools**, if their hydration depends on LLM inference. For example, we don't know which entity the user is asking about until we parse their prompt — so the first tool call resolves that, and subsequent tools reuse what's already loaded. Other agents can hydrate their deps entirely upon initialization.&lt;/p&gt;

&lt;p&gt;Caching the state of an agents deps and it's interactions (what is usually referred to as “conversation”, is also a powerful pattern for multi-turn conversations. From a user perspective, it ensures that one can resume their tasks, without any delays. From a developers perspective, it provides an easy way to debug and analyze Agent interactions. &lt;/p&gt;

&lt;h3&gt;
  
  
  3. System Prompt &amp;amp; Instructions
&lt;/h3&gt;

&lt;p&gt;There are two mechanisms for injecting prompts into the agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://ai.pydantic.dev/agent/#system-prompts" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;code&gt;system_prompt&lt;/code&gt;&lt;/strong&gt;&lt;/a&gt;: A static string set at construction time. Used when the prompt doesn't need runtime data.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.pydantic.dev/agent/#instructions" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;code&gt;instructions&lt;/code&gt;&lt;/strong&gt;&lt;/a&gt;: A list of callables that receive &lt;code&gt;RunContext&lt;/code&gt; and return strings. Evaluated at runtime with full access to deps. &lt;strong&gt;This is the preferred pattern&lt;/strong&gt; for dynamic prompts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reason we prefer &lt;code&gt;instructions&lt;/code&gt; is that we can dynamically inject context into the prompt to keep relationships between data and instructions as close as possible. We found that this way the agent is much more likely to take the correct context into account, and it leads to more understandable instructions. (e.g instead of saying "Given the following data, do X", we can say "Given that the &lt;code&gt;efficiency_ratio&lt;/code&gt; is 0.65, which is below the acceptable threshold of 0.8, analyze the potential causes and recommend improvements".). This also avoids the problem of having the context too far away from the instructions, which can lead to the LLM forgetting or ignoring it.&lt;/p&gt;

&lt;p&gt;As Donald Hebb said, "Neurons that fire together wire together" — the closer the context and instructions are in the prompt, the more likely the LLM is to associate them correctly. This is also the main reason we use elaborate tool docstrings and Model Field descriptions, in both input and output Tool Fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Tools
&lt;/h3&gt;

&lt;p&gt;Tools are async functions that do the &lt;strong&gt;heavy deterministic work&lt;/strong&gt; and return &lt;strong&gt;Pydantic models&lt;/strong&gt;. This is a core design principle: the agent decides &lt;em&gt;what&lt;/em&gt; to do, the tools do the actual computation. Essentially we believe that the LLM works best as the magic glue that connects data and decisions, while deterministic tools should be responsible for doing the actual computations and the math.&lt;/p&gt;

&lt;p&gt;A tool receives &lt;code&gt;RunContext[DepsType]&lt;/code&gt; as its first argument (auto-injected by Pydantic AI). The function's docstring and parameter annotations become the tool description the LLM sees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Pydantic models as return types?&lt;/strong&gt; By annotating return model fields with &lt;code&gt;Field(description=...)&lt;/code&gt;, the LLM gets self-documenting data. Each field carries its own explanation — what it means, its unit, its range. This is far more effective than returning raw dicts or strings, because the LLM can reason about the data accurately without needing extra prompt instructions, and it significantly reduces the chances of misinterpretation.&lt;/p&gt;

&lt;p&gt;The second point that isn't directly related to the LLM is that this leads to a much better developer experience. The more effort we put on better understandable data structures, the easier it is to keep our colleagues happy and productive.&lt;/p&gt;

&lt;p&gt;For example, imagine a tool that returns a cost analysis model with fields like &lt;code&gt;efficiency_ratio&lt;/code&gt; described as &lt;em&gt;"Fraction of demand fulfilled on time (0.0 to 1.0)"&lt;/em&gt; and &lt;code&gt;performance_breakdown&lt;/code&gt; described as &lt;em&gt;"Detailed statistics including delays and demand type breakdown"&lt;/em&gt;. The LLM reads these descriptions and understands exactly what it's looking at. That, along with data constraints (e.g., &lt;code&gt;ge=0.0&lt;/code&gt; and &lt;code&gt;le=1.0&lt;/code&gt;) makes it much more likely the LLM will interpret the results correctly and make informed decisions.&lt;/p&gt;

&lt;p&gt;Key conventions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tools can mutate deps to share state (e.g., caching fetched data for later tools).&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;ModelRetry&lt;/code&gt; to ask the LLM to correct its inputs and retry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Return Pydantic models&lt;/strong&gt; and in general structured self-documented models, not dictionaries or strings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Let tools do the heavy lifting&lt;/strong&gt; — simulations, calculations, comparisons, and business logic belong in deterministic tool code, not in LLM reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Output Type
&lt;/h3&gt;

&lt;p&gt;Two output patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;str&lt;/code&gt;&lt;/strong&gt; (default): Free-form text output. Used usually by conversational agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pydantic &lt;code&gt;BaseModel&lt;/code&gt;&lt;/strong&gt;: Structured output validated by Pydantic AI. Used when you need typed, parseable results (e.g., an extraction agent returning a model with &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;category&lt;/code&gt;, &lt;code&gt;scope&lt;/code&gt;, &lt;code&gt;adjustments&lt;/code&gt;). Pydantic has become the standard for structured LLM output in Python — even &lt;a href="https://platform.openai.com/docs/guides/structured-outputs" rel="noopener noreferrer"&gt;OpenAI's own SDK uses Pydantic for structured outputs&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Progressive Complexity
&lt;/h2&gt;

&lt;p&gt;Not every agent needs the full kitchen sink. We think about agent complexity in three levels, and we've found it helpful to start at Level 1 and graduate upward only when needed. We believe that the "art" of building AI agents is build on "less is more" — adding complexity only when absolutely necessary (You need to scrap half of your agent code and instructions to realise that first :D ).&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 1: Simple Structured Output
&lt;/h3&gt;

&lt;p&gt;The simplest pattern. The agent has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No tools&lt;/strong&gt; — the LLM processes input and returns structured data directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal deps&lt;/strong&gt; — just a company identifier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic prompt&lt;/strong&gt; — via a compilation system&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured output&lt;/strong&gt; — a Pydantic model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's invoked with &lt;a href="https://ai.pydantic.dev/agent/#running-agents" rel="noopener noreferrer"&gt;&lt;code&gt;agent.run()&lt;/code&gt;&lt;/a&gt; (no streaming needed) and returns a validated Pydantic object. This pattern works well for classification, extraction, and transformation tasks where the LLM doesn't need external data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Level 2: Tools + Dynamic Context + Post-Processing
&lt;/h3&gt;

&lt;p&gt;Builds on Level 1 by adding tools, rich dynamic prompts, and output post-processing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's new:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools&lt;/strong&gt; that fetch and process domain data for the LLM to infer and decide upon.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic context injection&lt;/strong&gt;: The system prompt is constructed at runtime by loading contextual data and interpolating it into the compiled base prompt. This means the prompt is finely curated for each particular task — minimal and on point — leading to better and faster solutions with lower costs. Only keep the absolutely necessary instructions for the LLM to solve the problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output post-processing&lt;/strong&gt;: After the LLM returns its result, a deterministic function applies business rules that can override the LLM's reasoning when critical conditions are met. That balances arbitrary decisions powered by the LLM with more deterministic guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Richer deps&lt;/strong&gt;: Computed fields and factory methods for construction. The reason we do that is to have the agent do as little calculation as possible and make the data interpretation seamless. Your deps could have a bunch of pre-calculated fields. &lt;strong&gt;Do not let the LLM do math&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Level 3: Full Agentic Workflow with Streaming
&lt;/h3&gt;

&lt;p&gt;The full kitchen sink — multi-turn conversations, streaming, precomputed reasoning, company-specific tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's new:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application-specific tool mapping&lt;/strong&gt;: Different applications can get different tool implementations, all registered via a dynamic configuration pattern. Only keep the absolutely necessary tools for variants of the same Agent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool renaming&lt;/strong&gt;: Long function names can be aliased to shorter LLM-friendly names. This is a great pattern for keeping it simple for the LLM while also maintaining code clarity for your fellow developers. It can also be used for overloading tool variants.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://ai.pydantic.dev/api/agent/#pydantic_ai.agent.AbstractAgent.iter" rel="noopener noreferrer"&gt;&lt;strong&gt;Streaming entry point&lt;/strong&gt;&lt;/a&gt;: An async generator that manages the full lifecycle: session creation, deps initialization, agent streaming, conversation persistence. The streaming is done via WebSockets, so users can see the LLM output, tool calls, and even the reasoning process live. Remember the last time you used any LLM application that didn't have streaming? Thats how it feels.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;i18n in the agent itself&lt;/strong&gt;: Given the fact that we have customers all over the world, it is important to not leave internationalization an afterthought, everything we do is already internationalized — this has become especially easy with the use of LLMs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deps factory with caching&lt;/strong&gt;: Deps are optionally hydrated from a cache to avoid re-fetching data in multi-turn conversations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Modular Prompts
&lt;/h2&gt;

&lt;p&gt;As your agent count grows, prompt management becomes a real challenge. You want to reuse common sections across agents, override specific parts per customer, and support multiple languages — without copy-pasting prompts everywhere.&lt;/p&gt;

&lt;p&gt;The key insight is to treat prompts like code: break them into modular, composable sections. Pydantic AI gives you two mechanisms for this — &lt;a href="https://ai.pydantic.dev/agent/#system-prompts" rel="noopener noreferrer"&gt;static &lt;code&gt;system_prompt&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://ai.pydantic.dev/agent/#instructions" rel="noopener noreferrer"&gt;dynamic &lt;code&gt;instructions&lt;/code&gt;&lt;/a&gt;. With &lt;code&gt;instructions&lt;/code&gt;, you can build a compilation layer on top that resolves sections with fallback chains (config-specific → agent default → global) and handles language variants automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  XML Over Markdown in Prompts
&lt;/h3&gt;

&lt;p&gt;We use &lt;strong&gt;XML tags heavily&lt;/strong&gt; in our system prompts instead of markdown headers. XML provides clearer semantic boundaries that LLMs parse more reliably — especially for nested, structured instructions. This is also &lt;a href="https://developers.openai.com/cookbook/examples/gpt-5/gpt-5-2_prompting_guide" rel="noopener noreferrer"&gt;recommended by OpenAI in their GPT-5.2 prompting guide&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;mission&amp;gt;&lt;/span&gt;
  Analyze the given data and recommend optimal parameters:
  &lt;span class="nt"&gt;&amp;lt;parameters&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;parameter&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"threshold"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"int"&lt;/span&gt; &lt;span class="na"&gt;min=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;parameter&lt;/span&gt; &lt;span class="na"&gt;name=&lt;/span&gt;&lt;span class="s"&gt;"buffer_size"&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"int"&lt;/span&gt; &lt;span class="na"&gt;min=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/parameters&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/mission&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;actions&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;general&amp;gt;&lt;/span&gt;Always run the analysis tool first to get an overview.&lt;span class="nt"&gt;&amp;lt;/general&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;evaluation_steps&amp;gt;&lt;/span&gt;
    Use the selected data point with its parameters and costs...
  &lt;span class="nt"&gt;&amp;lt;/evaluation_steps&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/actions&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;XML works particularly well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Structured data definitions&lt;/strong&gt; — parameters, component descriptions, tool declarations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nested instructions&lt;/strong&gt; — actions containing sub-steps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic boundaries&lt;/strong&gt; — the LLM clearly sees where one section ends and another begins&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Translation/terminology blocks&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Constants and Dynamic Sections
&lt;/h3&gt;

&lt;p&gt;Sections can support &lt;code&gt;{CONSTANT_NAME}&lt;/code&gt; placeholders that are injected from a centralized constants dict. This keeps magic values out of prompt text, as then you may need to update them in five different places, but the one you missed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Internationalization for LLM Agents
&lt;/h2&gt;

&lt;p&gt;This is a topic we don't see discussed enough. When your customers speak different languages and use different terminology, your agents need to handle that at two levels: &lt;strong&gt;prompt-level&lt;/strong&gt; (the terminology the LLM uses in its reasoning) and &lt;strong&gt;runtime-level&lt;/strong&gt; (labels in Code-generated output like tables and snippets).&lt;/p&gt;

&lt;p&gt;The idea is to maintain a base translation set per language, let each customer override specific terms (because one company's "delivery date" is another's "ship date"), and dynamically inject the resolved terminology into the system prompt at compile time. The LLM then uses consistent, customer-specific vocabulary without any extra prompting effort per request.&lt;/p&gt;




&lt;h2&gt;
  
  
  Agents Reason, Tools Compute
&lt;/h2&gt;

&lt;p&gt;A core principle we follow: &lt;strong&gt;agents reason, tools compute&lt;/strong&gt;. The LLM decides &lt;em&gt;what&lt;/em&gt; to do and &lt;em&gt;interprets&lt;/em&gt; the results — the actual computation lives in deterministic tool code. This aligns well with &lt;a href="https://www.anthropic.com/engineering/writing-tools-for-agents" rel="noopener noreferrer"&gt;Anthropic's thinking on writing effective tools for agents&lt;/a&gt;. Pydantic AI's &lt;a href="https://ai.pydantic.dev/tools/#retrying" rel="noopener noreferrer"&gt;&lt;code&gt;ModelRetry&lt;/code&gt;&lt;/a&gt; pattern is also worth mentioning here — when a tool receives invalid input, it tells the LLM what went wrong so it can correct and retry, instead of failing hard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature Flags
&lt;/h3&gt;

&lt;p&gt;Every major feature or improvement we ship is behind a feature flag, and the code remains backwards compatible to the state before it. Once the feature is fully consolidated and has been proven in production, any superfluous code paths can be removed. This means that incomplete features may reach the main branch before they're enabled, keeping merge conflicts and long-lived branches at bay. This has been a common practice in the software world, where for instance companies expose new features via experimental flags.&lt;/p&gt;




&lt;h2&gt;
  
  
  Streaming &amp;amp; Observability
&lt;/h2&gt;

&lt;p&gt;We stream all agent output to the frontend via &lt;a href="https://www.decodingai.com/p/deploying-agents-as-real-time-apis" rel="noopener noreferrer"&gt;WebSockets&lt;/a&gt; — users see tokens appear in real time, tool calls execute live, and the whole experience feels conversational. Pydantic AI supports this out of the box with &lt;a href="https://ai.pydantic.dev/agent/#streaming" rel="noopener noreferrer"&gt;&lt;code&gt;agent.run_stream()&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For observability, we integrate with &lt;a href="https://langfuse.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Langfuse&lt;/strong&gt;&lt;/a&gt; via &lt;a href="https://opentelemetry.io/blog/2024/llm-observability/" rel="noopener noreferrer"&gt;OpenTelemetry&lt;/a&gt;, giving us full traces of LLM calls, tool executions, token usage, and latency. When an agent makes a questionable recommendation, being able to trace back through its entire reasoning chain is invaluable for debugging and for building trust with domain experts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Open Problems
&lt;/h2&gt;

&lt;p&gt;These are topics that we feel are very relevant to us and to the broader future of AI-based solutions. They're also the kind of problems that get us excited to come to work in the morning (or keep us awake at night 😅).&lt;/p&gt;

&lt;h3&gt;
  
  
  Secure Code Execution by Agents
&lt;/h3&gt;

&lt;p&gt;Having agents generate and run arbitrary code can become a really powerful tool, but at the same time a security nightmare — it can literally be &lt;strong&gt;any&lt;/strong&gt; code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/pydantic/monty" rel="noopener noreferrer"&gt;Pydantic's Monty&lt;/a&gt; is a newly released library that solves this exact problem. It allows for low-latency, secure code execution designed for AI agents — essentially a sandboxed Python interpreter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic Context
&lt;/h3&gt;

&lt;p&gt;AI agents don't operate in a vacuum. Planners bring domain expertise, they know their suppliers, their materials, their constraints. Our &lt;strong&gt;context engine&lt;/strong&gt; is a dynamically adapting system that incorporates user-defined business rules and feedback directly into how agents reason. &lt;/p&gt;

&lt;p&gt;When a planner rejects a suggestion or defines a new rule, the system evolves and adjusts, thus future optimizations reflect those decisions. How exactly we assemble, layer, and adapt that context across agents and companies is something we keep under the hood.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Non-Deterministic Flows
&lt;/h3&gt;

&lt;p&gt;LLMs do not generate deterministic outputs. This makes testing complex agent flows especially challenging, as the usual convention of &lt;code&gt;assert a == b&lt;/code&gt; simply doesn't apply anymore.&lt;/p&gt;

&lt;p&gt;One approach we've experimented with is using a large enough sample of test cases combined with statistical metrics to assess whether the agent's solutions are in line with verified target solutions. By maintaining a big enough test set to account for unexpected variability, we can measure the global deviation of agent outputs from known-good results.&lt;/p&gt;

&lt;p&gt;But agents don't only produce numeric results — they generate explanations, analyses, and recommendations in natural language. For these cases, ideas we'd like to experiment with include using evaluation agents that validate the alignment between generated and expected explanations, as it would usually be impossible to do it any other way if the generated text is longer than a sentence.&lt;/p&gt;

&lt;p&gt;Some relevant tools and frameworks in this space:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/confident-ai/deepeval" rel="noopener noreferrer"&gt;&lt;strong&gt;DeepEval&lt;/strong&gt;&lt;/a&gt; — open-source LLM evaluation with metrics like G-Eval and faithfulness scoring&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide" rel="noopener noreferrer"&gt;&lt;strong&gt;Confident AI's guide on agent evaluation&lt;/strong&gt;&lt;/a&gt; — covers task success, tool usage quality, and reasoning coherence&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.braintrust.dev/articles/ai-agent-evaluation-framework" rel="noopener noreferrer"&gt;&lt;strong&gt;Braintrust's evaluation framework&lt;/strong&gt;&lt;/a&gt; — practical patterns for testing multi-step agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We would love to hear your ideas on this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;These patterns have evolved through iterative development and experimentation. We continuously evolve and refine this structure via brainstorming, experimentation, and bridging gaps as they arise. The goal is to build a system for building agents that can handle a wide range of complexity while remaining maintainable, testable, and adaptable to new use cases.&lt;/p&gt;

&lt;p&gt;This is still a field in its infancy, and along with the rapid LLM and tooling iterations, one can be a pioneer in defining how to best build agents that solve real-world problems, introducing workflows and solutions that were simply not possible less than two years ago. There is still a lot of room for innovation and improvement, especially in treating agents as first-class citizens with proper software engineering practices, testing methodologies, and design patterns, and not just another MVP.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;PS: We are not sponsored by Pydantic, we simply love open source tools like the ones from Astral.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
