Dimitrios Milonopoulos

Posted on Mar 17

License to Skill: Everything You Need to Take Your AI Agent Game to the Next Level

#ai #python #architecture

At Dryft, we build systems that replicate human decisions in industrial operations, through a combination of AI agents and mathematical optimization and simulation. Our agents analyze data and enterpise context to provide actionable recommendations, all in real-time conversations with domain experts.

All of our agents are built on Pydantic AI (We are big fans of Pydantic ). In this article we will use Pydantic AI as our point of reference for building agents, but the techniques and the concepts mentioned in the article should be applicable to any agentic framework.

The agents live under their own domain, as we follow Domain Driven Design (DDD). That means all core agent logic is implemented concretely in its own distinct domain, and not mixed with API routes, database models, or other concerns. This keeps the codebase clean and maintainable.

This post will explore some agent structuring patterns, from the simplest pattern to the most complex, along with the reasoning behind when to use each.

Agent Anatomy
Progressive Complexity
Modular Prompts
- XML Over Markdown in Prompts
- Constants and Dynamic Sections
Internationalization for LLM Agents
Agents Reason, Tools Compute
- Feature Flags
Streaming & Observability
Open Problems
Closing Thoughts

Agent Anatomy

Every agent we build is composed of at least these five building blocks. Understanding these makes it straightforward to go from "I need an agent that does X" to a working implementation.

1. Model & Settings

We centralize model settings in a single config. Each model has predefined settings for temperature, max tokens, and (for reasoning models) reasoning effort level. These can of course be overridden at the agent level depending on the needs.

The LLM provider is defined in a single factory method, and we fully manage its lifecycle — we do a lot of heavy asynchronous workflows and we want our agents to be thread safe. The idea is to be able to experiment with other providers via a single line of code.

We distinguish between reasoning models (like gpt-5.4 or o3(old news)) which have a reasoning_effort parameter, and standard models (like gpt-4.1) which use traditional temperature control. Choose reasoning models for complex analysis and cheaper models for simpler tasks like matching or classification.

For a gentle introduction to temperature, max tokens, and LLM decoding we recommend the HuggingFace blog post. Reasoning effort is inspired by the concept of Chain-of-Thought prompting, which each LLM provider implements in their own way and describes in their docs.

2. Dependencies (Deps)

Every agent declares a dependency type — a dataclass or Pydantic model that gets injected into tools at runtime via Pydantic AI's RunContext. Deps are the agent's "working memory" across tool calls.

They range in complexity depending on what the agent needs to do:

Minimal: A simple dataclass with just basic identifiers for the agent to run (e.g., a classification agent)
Rich: A Pydantic model with computed fields and a factory method (e.g., an agent that pre-computes deltas so the LLM doesn't have to do arithmetic)
Full: A dataclass with 30+ fields, different factory methods for hydration, and caching.

The key pattern: deps start sparse and get hydrated by tools**, if their hydration depends on LLM inference. For example, we don't know which entity the user is asking about until we parse their prompt — so the first tool call resolves that, and subsequent tools reuse what's already loaded. Other agents can hydrate their deps entirely upon initialization.

Caching the state of an agents deps and it's interactions (what is usually referred to as “conversation”, is also a powerful pattern for multi-turn conversations. From a user perspective, it ensures that one can resume their tasks, without any delays. From a developers perspective, it provides an easy way to debug and analyze Agent interactions.

3. System Prompt & Instructions

There are two mechanisms for injecting prompts into the agent:

system_prompt: A static string set at construction time. Used when the prompt doesn't need runtime data.
instructions: A list of callables that receive RunContext and return strings. Evaluated at runtime with full access to deps. This is the preferred pattern for dynamic prompts.

The reason we prefer instructions is that we can dynamically inject context into the prompt to keep relationships between data and instructions as close as possible. We found that this way the agent is much more likely to take the correct context into account, and it leads to more understandable instructions. (e.g instead of saying "Given the following data, do X", we can say "Given that the efficiency_ratio is 0.65, which is below the acceptable threshold of 0.8, analyze the potential causes and recommend improvements".). This also avoids the problem of having the context too far away from the instructions, which can lead to the LLM forgetting or ignoring it.

As Donald Hebb said, "Neurons that fire together wire together" — the closer the context and instructions are in the prompt, the more likely the LLM is to associate them correctly. This is also the main reason we use elaborate tool docstrings and Model Field descriptions, in both input and output Tool Fields.

4. Tools

Tools are async functions that do the heavy deterministic work and return Pydantic models. This is a core design principle: the agent decides what to do, the tools do the actual computation. Essentially we believe that the LLM works best as the magic glue that connects data and decisions, while deterministic tools should be responsible for doing the actual computations and the math.

A tool receives RunContext[DepsType] as its first argument (auto-injected by Pydantic AI). The function's docstring and parameter annotations become the tool description the LLM sees.

Why Pydantic models as return types? By annotating return model fields with Field(description=...), the LLM gets self-documenting data. Each field carries its own explanation — what it means, its unit, its range. This is far more effective than returning raw dicts or strings, because the LLM can reason about the data accurately without needing extra prompt instructions, and it significantly reduces the chances of misinterpretation.

The second point that isn't directly related to the LLM is that this leads to a much better developer experience. The more effort we put on better understandable data structures, the easier it is to keep our colleagues happy and productive.

For example, imagine a tool that returns a cost analysis model with fields like efficiency_ratio described as "Fraction of demand fulfilled on time (0.0 to 1.0)" and performance_breakdown described as "Detailed statistics including delays and demand type breakdown". The LLM reads these descriptions and understands exactly what it's looking at. That, along with data constraints (e.g., ge=0.0 and le=1.0) makes it much more likely the LLM will interpret the results correctly and make informed decisions.

Key conventions:

Tools can mutate deps to share state (e.g., caching fetched data for later tools).
Use ModelRetry to ask the LLM to correct its inputs and retry.
Return Pydantic models and in general structured self-documented models, not dictionaries or strings.
Let tools do the heavy lifting — simulations, calculations, comparisons, and business logic belong in deterministic tool code, not in LLM reasoning.

5. Output Type

Two output patterns:

str (default): Free-form text output. Used usually by conversational agents.
Pydantic BaseModel: Structured output validated by Pydantic AI. Used when you need typed, parseable results (e.g., an extraction agent returning a model with title, category, scope, adjustments). Pydantic has become the standard for structured LLM output in Python — even OpenAI's own SDK uses Pydantic for structured outputs.

Progressive Complexity

Not every agent needs the full kitchen sink. We think about agent complexity in three levels, and we've found it helpful to start at Level 1 and graduate upward only when needed. We believe that the "art" of building AI agents is build on "less is more" — adding complexity only when absolutely necessary (You need to scrap half of your agent code and instructions to realise that first :D ).

Level 1: Simple Structured Output

The simplest pattern. The agent has:

No tools — the LLM processes input and returns structured data directly
Minimal deps — just a company identifier
Dynamic prompt — via a compilation system
Structured output — a Pydantic model

It's invoked with agent.run() (no streaming needed) and returns a validated Pydantic object. This pattern works well for classification, extraction, and transformation tasks where the LLM doesn't need external data.

Level 2: Tools + Dynamic Context + Post-Processing

Builds on Level 1 by adding tools, rich dynamic prompts, and output post-processing.

What's new:

Tools that fetch and process domain data for the LLM to infer and decide upon.
Dynamic context injection: The system prompt is constructed at runtime by loading contextual data and interpolating it into the compiled base prompt. This means the prompt is finely curated for each particular task — minimal and on point — leading to better and faster solutions with lower costs. Only keep the absolutely necessary instructions for the LLM to solve the problem.
Output post-processing: After the LLM returns its result, a deterministic function applies business rules that can override the LLM's reasoning when critical conditions are met. That balances arbitrary decisions powered by the LLM with more deterministic guardrails.
Richer deps: Computed fields and factory methods for construction. The reason we do that is to have the agent do as little calculation as possible and make the data interpretation seamless. Your deps could have a bunch of pre-calculated fields. Do not let the LLM do math.

Level 3: Full Agentic Workflow with Streaming

The full kitchen sink — multi-turn conversations, streaming, precomputed reasoning, company-specific tools.

What's new:

Application-specific tool mapping: Different applications can get different tool implementations, all registered via a dynamic configuration pattern. Only keep the absolutely necessary tools for variants of the same Agent.
Tool renaming: Long function names can be aliased to shorter LLM-friendly names. This is a great pattern for keeping it simple for the LLM while also maintaining code clarity for your fellow developers. It can also be used for overloading tool variants.
Streaming entry point: An async generator that manages the full lifecycle: session creation, deps initialization, agent streaming, conversation persistence. The streaming is done via WebSockets, so users can see the LLM output, tool calls, and even the reasoning process live. Remember the last time you used any LLM application that didn't have streaming? Thats how it feels.
i18n in the agent itself: Given the fact that we have customers all over the world, it is important to not leave internationalization an afterthought, everything we do is already internationalized — this has become especially easy with the use of LLMs.
Deps factory with caching: Deps are optionally hydrated from a cache to avoid re-fetching data in multi-turn conversations.

Modular Prompts

As your agent count grows, prompt management becomes a real challenge. You want to reuse common sections across agents, override specific parts per customer, and support multiple languages — without copy-pasting prompts everywhere.

The key insight is to treat prompts like code: break them into modular, composable sections. Pydantic AI gives you two mechanisms for this — static system_prompt and dynamic instructions. With instructions, you can build a compilation layer on top that resolves sections with fallback chains (config-specific → agent default → global) and handles language variants automatically.

XML Over Markdown in Prompts

We use XML tags heavily in our system prompts instead of markdown headers. XML provides clearer semantic boundaries that LLMs parse more reliably — especially for nested, structured instructions. This is also recommended by OpenAI in their GPT-5.2 prompting guide.

<mission>
  Analyze the given data and recommend optimal parameters:
  <parameters>
    <parameter name="threshold" type="int" min="0"/>
    <parameter name="buffer_size" type="int" min="0"/>
  </parameters>
</mission>

<actions>
  <general>Always run the analysis tool first to get an overview.</general>
  <evaluation_steps>
    Use the selected data point with its parameters and costs...
  </evaluation_steps>
</actions>

XML works particularly well for:

Structured data definitions — parameters, component descriptions, tool declarations
Nested instructions — actions containing sub-steps
Semantic boundaries — the LLM clearly sees where one section ends and another begins
Translation/terminology blocks

Constants and Dynamic Sections

Sections can support {CONSTANT_NAME} placeholders that are injected from a centralized constants dict. This keeps magic values out of prompt text, as then you may need to update them in five different places, but the one you missed.

Internationalization for LLM Agents

This is a topic we don't see discussed enough. When your customers speak different languages and use different terminology, your agents need to handle that at two levels: prompt-level (the terminology the LLM uses in its reasoning) and runtime-level (labels in Code-generated output like tables and snippets).

The idea is to maintain a base translation set per language, let each customer override specific terms (because one company's "delivery date" is another's "ship date"), and dynamically inject the resolved terminology into the system prompt at compile time. The LLM then uses consistent, customer-specific vocabulary without any extra prompting effort per request.

Agents Reason, Tools Compute

A core principle we follow: agents reason, tools compute. The LLM decides what to do and interprets the results — the actual computation lives in deterministic tool code. This aligns well with Anthropic's thinking on writing effective tools for agents. Pydantic AI's ModelRetry pattern is also worth mentioning here — when a tool receives invalid input, it tells the LLM what went wrong so it can correct and retry, instead of failing hard.

Feature Flags

Every major feature or improvement we ship is behind a feature flag, and the code remains backwards compatible to the state before it. Once the feature is fully consolidated and has been proven in production, any superfluous code paths can be removed. This means that incomplete features may reach the main branch before they're enabled, keeping merge conflicts and long-lived branches at bay. This has been a common practice in the software world, where for instance companies expose new features via experimental flags.

Streaming & Observability

We stream all agent output to the frontend via WebSockets — users see tokens appear in real time, tool calls execute live, and the whole experience feels conversational. Pydantic AI supports this out of the box with agent.run_stream().

For observability, we integrate with Langfuse via OpenTelemetry, giving us full traces of LLM calls, tool executions, token usage, and latency. When an agent makes a questionable recommendation, being able to trace back through its entire reasoning chain is invaluable for debugging and for building trust with domain experts.

Open Problems

These are topics that we feel are very relevant to us and to the broader future of AI-based solutions. They're also the kind of problems that get us excited to come to work in the morning (or keep us awake at night 😅).

Secure Code Execution by Agents

Having agents generate and run arbitrary code can become a really powerful tool, but at the same time a security nightmare — it can literally be any code.

Pydantic's Monty is a newly released library that solves this exact problem. It allows for low-latency, secure code execution designed for AI agents — essentially a sandboxed Python interpreter.

Dynamic Context

AI agents don't operate in a vacuum. Planners bring domain expertise, they know their suppliers, their materials, their constraints. Our context engine is a dynamically adapting system that incorporates user-defined business rules and feedback directly into how agents reason.

When a planner rejects a suggestion or defines a new rule, the system evolves and adjusts, thus future optimizations reflect those decisions. How exactly we assemble, layer, and adapt that context across agents and companies is something we keep under the hood.

Testing Non-Deterministic Flows

LLMs do not generate deterministic outputs. This makes testing complex agent flows especially challenging, as the usual convention of assert a == b simply doesn't apply anymore.

One approach we've experimented with is using a large enough sample of test cases combined with statistical metrics to assess whether the agent's solutions are in line with verified target solutions. By maintaining a big enough test set to account for unexpected variability, we can measure the global deviation of agent outputs from known-good results.

But agents don't only produce numeric results — they generate explanations, analyses, and recommendations in natural language. For these cases, ideas we'd like to experiment with include using evaluation agents that validate the alignment between generated and expected explanations, as it would usually be impossible to do it any other way if the generated text is longer than a sentence.

Some relevant tools and frameworks in this space:

DeepEval — open-source LLM evaluation with metrics like G-Eval and faithfulness scoring
Confident AI's guide on agent evaluation — covers task success, tool usage quality, and reasoning coherence
Braintrust's evaluation framework — practical patterns for testing multi-step agents

We would love to hear your ideas on this.

Closing Thoughts

These patterns have evolved through iterative development and experimentation. We continuously evolve and refine this structure via brainstorming, experimentation, and bridging gaps as they arise. The goal is to build a system for building agents that can handle a wide range of complexity while remaining maintainable, testable, and adaptable to new use cases.

This is still a field in its infancy, and along with the rapid LLM and tooling iterations, one can be a pioneer in defining how to best build agents that solve real-world problems, introducing workflows and solutions that were simply not possible less than two years ago. There is still a lot of room for innovation and improvement, especially in treating agents as first-class citizens with proper software engineering practices, testing methodologies, and design patterns, and not just another MVP.

PS: We are not sponsored by Pydantic, we simply love open source tools like the ones from Astral.

DEV Community