DEV Community: Peng Qian

Ride the wave of AI coding, don't get swept away by it. In my latest article, I dive into the practical details of building your own personal AI coding setup using OpenCode Oh-My-OpenCode-Slim OpenSpec. It will help you get a better handle on AI coding.

Peng Qian — Fri, 17 Apr 2026 12:14:25 +0000

Peng Qian

Apr 17

How I Use OpenCode, Oh-My-OpenCode-Slim, and OpenSpec to Build My Own AI Coding Environment

#vibecoding #coding #ai #python

Comments

8 min read

How I Use OpenCode, Oh-My-OpenCode-Slim, and OpenSpec to Build My Own AI Coding Environment

Peng Qian — Fri, 17 Apr 2026 12:11:48 +0000

Introduction

I have never used Claude Code. The reason is simple. Claude Code is too expensive. Even with a subscription, the cost-to-value ratio does not work for my research.

So I have been building my own AI coding environment using OpenCode as the foundation, combined with Oh-My-OpenCode-Slim (multi-agent orchestration) and OpenSpec (SDD).

My take is this: if you understand what you want to build, and you know how to use coding tools properly, especially with well-written Spec files as constraints, frontier open-source models like Qwen3.6-Plus, Kimi-k2.5, and GLM-5 can handle your daily coding tasks just fine.

There is another huge advantage to open-source software: community power. The community can tune system prompts and model parameters to fit different models well, and get the most out of open-source models.

In this article, I want to share what I have learned from using OpenCode and its surrounding tools. I will skip the generic tutorials you find everywhere online and focus only on the details I think actually matter. I hope this helps you make better choices.

Tool Installation and Environment Setup

I will cover my experience in two parts: installing and configuring OpenCode and its related plugins, and my AI coding workflow.

Let's start with tool installation and environment setup, beginning with OpenCode itself.

Installing OpenCode

Unlike most coding agents that only offer a TUI-based command-line tool, OpenCode also comes with a desktop app with a graphical interface. I use the desktop app for my daily coding work. It is clearly much more efficient than the TUI version.

That said, you still need to install the command-line program first. From my testing, some plugins need a command-line environment to check whether OpenCode is installed on your machine during project initialization.

First, make sure you have Node.js installed. Then run this npm command to install the OpenCode CLI:

npm i -g opencode-ai

After that, go to the official website, download the OpenCode Desktop installer, and double-click to install.

Configuring OpenCode

After installing OpenCode Desktop, open the app. Once you select your project directory, you will land on the main OpenCode interface. The features are fairly intuitive, so I will not walk through each one. But before you type your first Hello World, you should check your terminal configuration first.

Configuring the terminal

I use Windows 11. On Windows, OpenCode Desktop defaults to PowerShell as its terminal. Many companies, though, do not allow PowerShell. If you are in a non-English locale, OpenCode may run into character encoding issues when running shell commands through PowerShell, causing those commands to fail.

In that case, you need to change your default terminal.

OpenCode uses the SHELL environment variable to determine which terminal to use. You can configure Windows Command Prompt (cmd.exe), WSL, or git bash. Personally, I prefer cmd.exe because I had already installed a lot of CLI tools before setting up OpenCode. Using cmd.exe directly saves me from reinstalling everything.

SET SHELL="%windir%\system32\cmd.exe"

Configuring model providers

Next, let's talk about how to configure model providers.

Open the settings window and select "Providers". You will see a list of the most popular model providers and API relays. If you want to use open-source models, though, they probably will not be on that list.

At that point, you might click "Custom provider" and manually fill in the model id, base url, api key, and so on. The problem is that OpenCode then has no idea about your model's context window size or pricing, which causes features like automatic context compression to stop working correctly.

The right approach is to click the "Show more providers" link at the bottom, find the provider you want to add, and enter that provider's api key.

Once configured, all models from that provider will appear in the model list. These models come with metadata like context size and pricing, so context management plugins can work correctly.

The downside is that you cannot directly see your provider ID this way, which makes it tricky to configure Oh-My-OpenCode-Slim later.

No worries. OpenCode already saved your provider configuration when you selected your provider. You can find it at ~/.local/share/opencode/auth.json. Your provider ID and API key are both there.

Enable workspaces

The biggest difference between AI coding and traditional coding is that while you wait for the AI to work, you can actually work on another requirement at the same time. If you use git for version control, you would normally need to create a separate directory and check out a new branch.

Or you can use git's worktree feature to create a new worktree on top of your current branch. When you are doing parallel development, using worktrees is much more convenient than creating new branches.

Compared to OpenCode CLI, the desktop app has a clear advantage here: it natively supports the worktree feature. In OpenCode Desktop, this feature is called "workspace".

The way to open a workspace is a bit hidden. Right-click the project icon in the top-left corner of the window, then select "Enable Workspace" from the menu. From there, you can create multiple workspaces in the conversation list and work on them simultaneously. The corresponding branches and code directories will be created automatically.

When the coding work in a workspace is done, you can ask the AI to submit a PR for the current code, then close the workspace. The branch and code directory that were created for it will be cleaned up automatically. Very convenient.

Choosing the right agent

If you want to use OpenCode without any extra plugins, pay attention to how you use agents.

OpenCode has two types of agents: primary agents, which you choose yourself, and sub-agents, which the primary agent calls on its own when needed.

Without any plugins installed, OpenCode only provides two primary agents: Build and Plan: The Build agent has full tool access and is the default choice. The Plan agent has no editing permissions. Its job is to ask you clarifying questions when you describe a requirement, and eventually produce an execution plan.

When you first try this, you might go straight to the Build agent. But for complex tasks, Build tends to just start coding based on its own interpretation. That is like looking through a straw. It fixes things locally without thinking through the overall architecture and design patterns.

The right approach is to start every new requirement with the Plan agent for requirement clarification, and get a solid execution plan first. Only then should you hand things off to Build to start development.

But even that is not enough. Model context is limited. As coding progresses, the execution plan from earlier in the conversation can get pushed out of the context window.

A better approach is to ask Build to save the execution plan as a Markdown file before starting to code. Review that file, confirm everything looks good, then start a fresh session and have Build load the execution plan document back in before executing.

Once you start working this way, you will feel how much value comes from planning before executing. It also sets you up well for SDD coding later, which I will cover when we get to OpenSpec.

Always start a new session

I just mentioned that after forming a development plan, you should create a new session before continuing with coding. Why?

Because anyone familiar with LLMs knows that even though context windows are long today, and OpenCode does offer context compression, context rot is still a real problem. LLMs pay more attention to the beginning and end of the context window, and less to the middle. I call this positional bias.

So to make sure the LLM follows instructions accurately based on the conversation context, especially after forming an execution plan where you need precise execution, start a new session after each major milestone. Do not keep working in the same session forever.

Do not forget to create AGENTS.md

The AGENTS.md file is called "rules" in OpenCode. You can create it automatically with the /init command. It tells the LLM what rules to follow during coding.

You may ask: if this file is created by the LLM, does that not mean the LLM already knows all these rules internally? Is saving them to a file redundant?

Not at all. AGENTS.md is a file written specifically for the LLM to read. In my view, it serves three important purposes:

First, AGENTS.md acts as long-term memory for the project. It locks in facts and choices. For example, after asking the LLM to set up the project structure or create a new module, run /init once. The project architecture gets locked into AGENTS.md. Without this, the LLM will scan the entire project from scratch every time you start a new session, wasting a huge number of tokens.

Another example: if you use uv to manage your project and use uv sync --prerelease=allow to sync prerelease dependencies, write that clearly in AGENTS.md. This prevents the LLM from making errors with dependency management.

Second, AGENTS.md narrows the probability distribution and reduces hallucinations. LLMs are probability models. When facing a question, an LLM generates several possible answers with associated probabilities, then picks one based on the temperature parameter.

For example, when a method parameter can accept multiple types, the LLM might consider these options:

Use Optional[int] (40% chance)
Use int | None (40% chance)
Use no type annotation at all (20% chance)

At that point, the LLM will randomly pick Optional[int] or int | None.

But once you explicitly require str | None syntax in AGENTS.md, the probability distribution shifts to:

Use int | None (100% chance)
Use Optional[int] (0% chance)

At this point, the LLM will just go ahead and pick int | None as the final answer.

Third, AGENTS.md serves a harness engineering purpose. You can give the LLM direct instructions through AGENTS.md that it must follow. For example, you can tell the LLM to communicate with you in Chinese, or require that during the spec-driven process, it cannot create new proposals without your approval.

Improve the success rate of Skills loading

By now, most people in the AI coding space have heard of Skills. But many find that Skills do not load reliably under normal conditions.

There are two reasons for this. On one hand, the description in a Skill's front matter is often unclear. We need to write the front matter carefully, especially the description part. It should clearly explain when the Skill applies and what it provides, so the LLM can load the right Skill for the situation.

On the other hand, for common coding scenarios, LLMs have learned so much during pretraining that they do not feel the need to load a Skill for extra guidance.

For this, there is a simple and proven fix. Just add this line to AGENTS.md:

Here's what's coming up next:

How to improve the Skills loading success rate.

How to properly configure Oh-My-OpenCode-Slim.

The complete OpenSpec and SDD development workflow.

Visit Data Leads Future to read the full content.

The hardest part of enterprise AI agents is wiring into real workflows. I built a setup where: ✅Agent Skills load in real time from a database ✅Scripts run safely in containers ✅A “skills agent” acts as a tool to keep the main agent’s context clean

Peng Qian — Thu, 19 Mar 2026 10:34:25 +0000

Peng Qian

Mar 19

How to Use Agent Skills in Enterprise LLM Agent Systems

#ai #programming #datascience #tutorial

Comments 3

11 min read

How to Use Agent Skills in Enterprise LLM Agent Systems

Peng Qian — Thu, 19 Mar 2026 10:32:10 +0000

Introduction

Enterprise-grade agentic systems have fallen way behind the desktop agent apps that everyone's been buzzing about lately.

After spending the better part of a year building enterprise agent applications, I came to one conclusion: if your agent system can't plug into your company's existing business processes, it won't bring real value to your organization.

Desktop systems like OpenClaw and Claude Cowork solved this problem. They don't change their agent setup at all. Instead, they use Agent Skills to capture human business processes, then share those skills between desktop agent systems through the file system. That's how they tackle one business problem after another.

But enterprise users write their skills through a web interface and save them to a database. There's a good chance the process involves complex approval and security audit steps, too. So how does your agent load these skills in real time without any downtime?

The latest version of Microsoft Agent Framework finally makes this possible with its Agent Skills feature.

TL;DR

With Agent Skills in Microsoft Agent Framework, enterprise agent systems can load user-defined business process skills from a database in real time, and run the scripts and generated code that come with those skills safely inside containers.

Your agent system stays secure and stable, while gaining the same flexible business process orchestration that desktop agents enjoy.

All the source code in this tutorial is available at the end of the article.

Before We Start

Install the latest Microsoft Agent Framework

To use Agent Skills, install the latest version of Microsoft Agent Framework:

pip install agent-framework --pre

Or, like me, you can pin the version of agent-framework in your pyproject.toml:

dependencies = [
    "agent-framework>=1.0.0rc4",
    "agent-framework-ag-ui>=1.0.0b260311",
]

Then tell uv to allow prerelease versions:

uv sync --prerelease=allow

Install Tavily Agent Skills

My end goal is to show you how to share and load Agent Skills between agents deployed across distributed nodes. But I think we should start simple. First, let me show you how to load and use skills from the community.

Let's start with Tavily Agent Skills. We'll only load the tavily-best-practices skill. It guides my agent on how to generate Tavily-based search code based on the task at hand, instead of calling a hardcoded function:

npx skills add tavily-ai/skills

Don't worry. After the initial demo, I'll walk you through how to load skills from a database in real time.

How to Load Agent Skills from Disk

Let's start with the most basic approach.

In Microsoft Agent Framework, context operations are handled by a base class called ContextProvider. The latest version of MAF ships a SkillsProvider class. Use it directly and pass the location of your skills through the skill_paths attribute, and you're done. skill_paths doesn't require a default directory like .claude/skills, and you can pass in multiple paths.

skills_provider = SkillsProvider(
    skill_paths=get_current_directory() / ".agents/skills",
)

Next, create your agent and pass the skills_provider instance through context_providers.

skills_agent = chat_client.as_agent(
    name="SkillsAssistant",
    instructions="You're a helpful assistant, and you'll respond to user requests according to your skills.",
    context_providers=[skills_provider],
    tools=[code_tool],
)

To run the Python code the agent writes based on the Tavily skill instructions, you need to pass a code_interpreter tool to the agent. Let the code run inside a container environment. I'll cover that in detail later.

Write a main method to test the agent:

async def main():
    async with code_executor:
        session = agent.create_session()
        result = await skills_agent.run(
            "Check how gold ETFs performed in February 2026 and give some investment advice.",
            session=session
        )

        print(result)

Microsoft Agent Framework provides an OpenTelemetry-based telemetry tool. I hooked it up to MLflow. Let's run the agent once and see what happens:

You can see that once the agent decided it needed Tavily to search, it loaded the full SKILL.md document, wrote Tavily search code following the instructions, then sent it to the code interpreter for execution. Exactly what we expected.

You can learn how to use MLFlow in this article:

Monitoring Qwen 3 Agents with MLflow 3.x: End-to-End Tracing Tutorial

How Agent Skills Work

Now let's talk about how to get the most out of Agent Skills in enterprise systems. That means loading external skills in real time, containerizing the code interpreter, and managing context more carefully.

But before we go there, let's dig into how Agent Skills actually work inside MAF, so the rest of this tutorial makes more sense.

As I mentioned, SkillsProvider extends BaseContextProvider, which means it works by operating on the agent's context.

When you initialize SkillsProvider, you pass one or more search paths to the skill_paths attribute. Take the .agents/skills directory as an example. On startup, SkillsProvider recursively searches this directory and finds every subdirectory that contains a SKILL.md file. Then it extracts the name and description fields from each SKILL.md file, along with the file content, and stores everything in a Skill object.

SkillsProvider loops through these Skill objects, formats the name and description fields like this, and merges them into the agent's system prompt. This keeps the agent aware of available skills without loading their full content upfront.

lines.append("  <skill>")
lines.append(f"    <name>{xml_escape(skill.name)}</name>")
lines.append(f"    <description>{xml_escape(skill.description)}</description>")
lines.append("  </skill>")

SkillsProvider also adds two methods to the agent through context: load_skill and read_skill_resource. When the agent decides which skill it needs based on the user's request, it calls load_skill to look up the matching Skill object by name and loads its full content into the context.

If a skill's content references extra resource files like references/search.md, the agent can call read_skill_resource to load those files.

Here's the full workflow:

This design follows the progressive disclosure principle defined by agentskills.io. Skill content loads into the agent's context gradually, only when needed. No context explosion, no wasted tokens.

Agent Skills for Enterprise Systems

Alright, enough theory. Let's get into today's main topic: how to use Agent Skills in enterprise-grade agentic systems.

Load skills from external systems in real time

What if business users write their skills through a cloud-based web page and save them to a database? How do you handle that?

We need a new approach to sync and apply Agent Skills in real time.

As I covered earlier, when SkillsProvider initializes, it loads all SKILL.md files from the input paths into an in-memory list of Skill objects.

Besides the file system approach, SkillsProvider also supports Code Defined Skills, where you write skill content directly in code:

from pathlib import Path
from agent_framework import Skill, SkillsProvider

my_skill = Skill(
    name="my-code-skill",
    description="A code-defined skill",
    content="Instructions for the skill.",
)

Then pass it to SkillsProvider through the skills attribute:

skills_provider = SkillsProvider(
    skill_paths=Path(__file__).parent / "skills",
    skills=[my_skill],
)

This opens the door to managing and loading skills from a database. But the original SkillsProvider class only accepts skills at initialization time. We want to load skills dynamically while the agent system is running, so we need to extend SkillsProvider.

After reading the source code, I found that every class extending BaseContextProvider has a before_run method that gets called when the agent calls run. We can load the latest skills from the database before before_run executes, then update SkillsProvider's self._skills list and refresh the skills description in instructions.

What I need is a hook method. Every time before before_run runs, this hook fetches the latest skills. All I need to do is put the database fetching logic inside this hook.

The simplest way to give SkillsProvider this hook is to build an UpdatableSkillsProvider subclass. This subclass accepts a skills_updater parameter at initialization:

class UpdatableSkillsProvider(SkillsProvider):
    def __init__(
        self,
        skill_paths: str | Path | Sequence[str | Path] | None = None,
        *,
        skills_updater: Callable[[], Awaitable[Sequence[Skill]]] | None = None,
        **kwargs
    ):
        super().__init__(
            skill_paths=skill_paths,
            **kwargs,
        )
        self._skills_updater = skills_updater
        ...

UpdatableSkillsProvider calls the hook through a private _update method, which also updates self._skills and the agent's system prompt. Then before_run calls _update to keep skills fresh in real time:

class UpdatableSkillsProvider(SkillsProvider):
    ...
    async def _update(self) -> None:
        if self._skills_updater is None:
            return

        try:
            new_skills = await self._skills_updater()

            for skill in new_skills:
                self._skills[skill.name] = skill

            has_scripts = any(s.scripts for s in self._skills.values())

            self._instructions = _create_instructions(
                prompt_template=self._instruction_template,
                skills=self._skills,
                include_script_runner_instructions=has_scripts,
            )

            self._tools = self._create_tools(
                include_script_runner_tool=has_scripts,
                require_script_approval=self._require_script_approval,
            )

        except Exception as exc:
            logger.exception("Failed to update skills: %s", exc)

    @override
    async def before_run(
        self,
        *,
        **kwargs
    ) -> None:
        await self._update()
        await super().before_run(
            **kwargs
        )

Let's write a get_latest_skills hook to simulate loading the latest skills from a database:

@lru_cache
async def get_latest_skills() -> list[Skill]:
    """
    Pseudocode. In this hook method, you can read the skills text from the database 
    and dynamically build Skill objects.
    :return: 
    """
    code_style_skill = Skill(
        name="code-style",
        description="Coding style guidelines and conventions for the team",
        content=dedent("""\
            Use this skill when answering questions about coding style,
            conventions, or best practices for the team.
        """),
    )

    return [code_style_skill]

Call the agent's run method, then check in MLFlow whether the skills loaded by get_latest_skills show up in the agent's system prompt:

The hook method works. We can now load skills from a database in real time.

Run scripts from skills safely inside containers

As of the latest version, Microsoft Agent Framework can't run Python scripts locally or inside containers. But most skills guide the agent through business logic using scripts, so we need to give the agent the ability to run those scripts in a code interpreter.

As the predecessor to MAF, Autogen provided a way to run Python scripts inside Docker containers. You can learn about that in this article:

Exclusive Reveal: Code Sandbox Tech Behind Manus and Claude Agent Skills

We need something like Autogen's DockerCommandLineCodeExecutor for Agent Framework. With the help of AI coding tools, building a code executor for Agent Framework isn't hard. (You can find it in the source code repo at the end of the article.)

code_executor = DockerCommandLineCodeExecutor(
    image="python-code-sandbox",
    work_dir=work_dir,
    delete_tmp_files=True,
    environment={
        "TAVILY_API_KEY": os.environ.get("TAVILY_API_KEY"),
    }
)

To keep LLM calls simple, we also need an object-oriented CodeExecutionTool:

class CodeExecutionTool:
    """Tool for executing code using a CodeExecutor."""

    def __init__(self, executor: CodeExecutor) -> None:
        self._executor = executor
    async def execute_code(self, code: str, language: Literal["python", "sh"] = "python") -> str:
        result = await self._executor.execute_code_blocks(
            [CodeBlock(code=code, language=language)],
            CancellationToken(),
        )
        return result.output

Next, initialize an execute_code tool and wire it up to the agent at initialization:

code_tool = CodeExecutionTool(code_executor).execute_code

In MLflow, you can see that when the agent needs to search the web, it generates Python code based on the skill's instructions and sends it to the container for execution:

This approach not only lets the agent run code defined in skills, but also keeps that execution safe inside a container.

Of course, in a server-side deployment, you'd send code to a centralized Jupyter kernel environment for execution. But that's a whole other story. You can dig into that in my other articles.

How I Crushed Advent of Code And Solved Hard Problems Using Autogen Jupyter Executor and Qwen3

Reduce context length even further

Agent Skills uses progressive disclosure to keep irrelevant skill content from eating up your context window. But as the conversation or task moves forward, skill content that was loaded into earlier messages will still pile up in the context over time.

Agent systems today have several context pruning techniques available. Context trimming and context compression, both common in desktop agents, work really well.

Beyond those two, today I want to share a context engineering technique I discovered at work that fits Agent Skills even better.

As you know, in enterprise scenarios, loading a skill usually means running one atomic workflow: researching a topic through web search? Sure. Running a SWOT analysis on a company and writing a report? No problem.

These workflows all share one thing in common. You give the agent the right input, then wait for it to return an output. Which skill the agent loaded, and how it worked through the task — I honestly don't care. I wouldn't even mind if the agent unloaded the skill after finishing to save tokens.

That sounds a lot like how a function works. So, can we use an agent with skills loaded as a tool for another agent? Absolutely. That's exactly what I do.

Microsoft Agent Framework has a method on Agent called as_tool. It turns an agent into a function-callable tool.

So I designed a main agent. The main agent takes user requests and generates the right response to return. The agent with Agent Skills loading capability turns itself into a tool for the main agent using as_tool.

agent = chat_client.as_agent(
    name="Assistant",
    instructions=dedent("""
    You're a smart little helper who, for each user request, 
    picks the right task description to call a tool, gets the answer, 
    and then delivers the final result.
    """),
    tools=[skills_agent.as_tool()],
)

The skills agent's workflow stays the same. It loads the right skill based on the task description, generates and runs code, then returns the result.

But the main agent is different. Its context only holds user messages, the message calling the skills agent tool, and the final response. No skill-related content at all. The main agent's context stays clean, and even after running for a long time, it won't interfere with the LLM.

There's a nice bonus too. LLMs know what they want better than humans do, so before the main agent calls the skills agent, it rewrites the user's task into something more precise. This helps the skills agent execute more accurately.

Conclusion

That's everything I have for you today on Agent Skills for enterprise agent systems.

Unlike desktop agents, enterprise agent systems run on cloud servers. There's no way to update an agent's skills through the file system in real time without downtime.

So I went with a targeted approach. This approach lets users write skill content through a web interface and save it to a database, while agents read the latest skills in real time and sync them across server nodes.

I used the latest version of Microsoft Agent Framework to build this, but you can use any other framework. The principles are the same.

I also covered how to run scripts the agent generates from skills inside containers, which is much safer than running scripts directly on a desktop system.

I shared a context management approach I found at work that works especially well for skills-based agents.

The Microsoft Agent Framework API is still a bit unstable. If anything is unclear, feel free to leave a comment, and I'll get back to you as soon as I can.

Thanks for reading! Share this with your friends if you think it might help someone else.

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.

Most “agent memory” setups just save everything and hope semantic search will fix it. I share a pattern that works better: ✅ Use an LLM to decide what is worth remembering ✅ Store memories in RedisVL ✅ Run memory extraction in parallel with asyncio

Peng Qian — Tue, 10 Feb 2026 11:49:54 +0000

Peng Qian

Feb 10

Advanced RedisVL Long-term Memory Tutorial: Using an LLM to Extract Memories

#ai #agents #programming #redis

Comments 1

10 min read

Advanced RedisVL Long-term Memory Tutorial: Using an LLM to Extract Memories

Peng Qian — Tue, 10 Feb 2026 11:49:27 +0000

Introduction

In this weekend note, we keep talking about how to build long-term memory for an agent with RedisVL.

When we build a long-term memory module for an agent, we need to care about two points most:

After long running, will the saved memories grow too large and cause context explosion?
How do we recall the memories that matter most to the current context?

We will solve these two problems today.

TLDR: In this hands-on tutorial, we first use an LLM to extract information from user messages that has value for later chats. Then we store that as long-term memory in RedisVL. When needed, we search related memories with semantic search. With this setup, the agent understands the past context of the user and gives more accurate answers.

With this kind of long-term memory, we do not worry about memory explosion after long running. We also do not worry that unrelated memories will hurt LLM responses.

You can get all the source code at the end of this post.

Why do we do this?

In the last hands-on post, I shared how to build short-term and long-term memory for an agent with RedisVL:

Build Long-Term and Short-Term Memory for Agents Using RedisVL

The short-term part works very well. RedisVL API feels much simpler than the raw Redis API. I can write that code with little effort.

The long-term part does not work. We follow the official example and store user queries and LLM responses in RedisVL. When chat continues, RedisVL keeps pulling repeated queries or unrelated answers via semantic search. This troubles the LLM a lot and blocks the chat from going on.

Can we avoid RedisVL?

Your boss will not agree. There is already Redis in your stack. Why do you still want to install mem0 or other open source tools? How about extra cost? This is real life.

So we still need to make RedisVL work. But I do not want to run around like a headless fly. Before that I want to see how humans handle memory.

How Humans Handle Memory

What deserves memory

First, we need to know one thing. Only information that centers on me and links to me tightly deserves a sticky note.

So what information about me do I want to write down?

Preference settings such as tools I like, languages I use, my schedule, and my tone when I talk
Stable personal info such as my role, my time zone, and my daily habits
Goals and decisions, such as chosen options, plans, and sentences that start with “I decide to...”
Key milestones such as job change, moving, deadlines, and product launches
Work and project context, such as project names, stakeholders, needs, and status like “done/next” step.
Repeated pain points or strong views that will change LLM advice later
Things I say with “remember this ...” or “do not forget ...”

What does not deserve memory

I do not plan to store any LLM answer. LLM answers to the same question will change with context. So LLM answers in long-term memory do not help much.

Besides LLM answers, I also do not want to keep these:

One-time small things that likely will not matter later
Very sensitive personal data such as health diagnosis, exact address, government IDs, passwords, bank accounts
Things I clearly ask not to remember
Things I already wrote down on the sticky note

Design a Prompt for LLM Memory Extraction

Now we know how humans handle memory. Next, I want to build an agent that follows the same rules and extracts memories from my daily chats.

The key lives in the system prompt. I need to describe all rules in the system prompt. Then I ask the agent to follow these rules with very high consistency.

In the past, I might have tried some “write 1000-line prompt” challenge. Now I do not need that. I just open any LLM client, paste these rules, then ask the LLM to help me write a system prompt. This takes less than one minute.

After a few tries, I pick one I like. Here is that system prompt:

Your job:
Based only on the user’s current input and the existing related memories, decide whether you need to add a new “long-term memory,” and if needed, **extract just one fact**. You do not talk to the user. You only handle memory extraction and deduplication.

---

### 1. Core principles

1. Only save information that **will likely be useful in the future**.
2. **At most one fact per turn**, and it must clearly appear in the current input.
3. **Never invent or infer anything**. You can only restate or lightly rephrase what the user has explicitly said.
4. If the current input has nothing worth keeping, or the information is already in the related memories, then do not add a new memory.

---

### 2. What counts as “long-term memory”

Only consider the categories below, and decide whether the information has long-term value:

...

Due to space, I only show part of the prompt here. You can get the full prompt from the source code at the end.

Build a ContextProvider for Long-term Memory

After we finish the memory extraction rules, we start to build the long-term memory module for the agent.

For future use, I still pick Microsoft Agent Framework MAF. It gives a ContextProvider feature that lets us plug long-term memory into the agent in a simple way.

Of course the principle of long-term memory stays the same. You can use any agent framework you like and build your own memory module. Or you can ignore frameworks and first build storage and retrieval of memories, and then call them through function calls. That is fine.

Run memory extraction in sequence

In the last post, I already built a long-term memory module with ContextProvider. The new version looks similar. But this time, we use an LLM to extract memories. So after we set up ContextProvider, we first use the system prompt to build a memory extraction agent.

If you do not know how to use ContextProvider yet, I suggest you read my last post again. That post explains ChatMessageStore and ContextProvider in Microsoft Agent Framework in detail:

Build Long-Term and Short-Term Memory for Agents Using RedisVL

To avoid too much unrelated data from Redis, I set the distance_threshold value pretty small. But not too small. If too small, then it loses meaning. You can pick the value you like.

class LongTermMemory(ContextProvider):
    def __init__(
        self,
        thread_id: str | None = None,
        session_tag: str | None = None,
        distance_threshold: float = 0.3,
        context_prompt: str = ContextProvider.DEFAULT_CONTEXT_PROMPT,
        redis_url: str = "redis://localhost:6379",
        embedding_model: str = "BAAI/bge-m3",
        llm_model: str = Qwen3.NEXT,
        llm_api_key: str | None = None,
        llm_base_url: str | None = None,
    ):
        ...
        self._init_extractor()

    def _init_extractor(self):
        with open("prompt.md", "r", encoding="utf-8") as f:
                system_prompt = f.read()

        self._extractor = OpenAILikeChatClient(
            model_id=self._llm_model,
        ).as_agent(
            name="extractor",
            instructions=system_prompt,
            default_options={
                "response_format": ExtractResult,
                "extra_body": {"enable_thinking": False}
            },
        )

Next we implement the invoking method. This method runs before the user agent calls the LLM. In this method, we extract and store long-term memory.

To make the logic clear, I first implement the invoking method in order, as in this diagram:

When a new user request comes into ContextProvider, we first search RedisVL with semantic search for the most similar memories.

class LongTermMemory(ContextProvider):
    ...

    async def invoking(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        **kwargs: Any
    ) -> Context:
        if isinstance(messages, ChatMessage):
            messages = [messages]
        prompt = "\n".join([m.text for m in messages])

        line_sep_memories = self._get_line_sep_memories(prompt)
        ...

    def _get_line_sep_memories(self, prompt: str) -> str:
        context = self._semantic_store.get_relevant(prompt, role="user", session_tag=self._session_tag)
        line_sep_memories = "\n".join([f"* {str(m.get("content", ""))}" for m in context])

        return line_sep_memories

Next, we send these existing memories plus the user request to the memory extraction agent. That agent first checks if anything is worth saving according to the rules. Then it extracts a new helpful memory from the user request and saves it into RedisVL.

class LongTermMemory(ContextProvider):
    ...

    async def invoking(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        **kwargs: Any
    ) -> Context:
        ...
        await self._save_memory(messages, line_sep_memories)
        ...

    async def _save_memory(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        relevant_memory: str | None = None,
    ) -> None:
        detect_messages = (
            [
            ChatMessage(role=Role.USER, text=f"Existing related memories：\n\n{relevant_memory}"),
            ] + list(messages)
            if relevant_memory.strip()
            else list(messages)
        )
        response = await self._extractor.run(detect_messages)

        extract_result: ExtractResult = cast(ExtractResult, response.value)
        if extract_result.should_write_memory:
            self._semantic_store.add_messages(
                messages=[
                    {"role": "user", "content": extract_result.memory_to_write}
                ],
                session_tag=self._session_tag,
            )

Last, we put the memories from RedisVL into Context as extra context. These memory messages get merged into the history of the real chat agent. They give the chat agent extra background to produce answers.

class LongTermMemory(ContextProvider):
    ...

    async def invoking(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        **kwargs: Any
    ) -> Context:
        ...

        return Context(messages=[
            ChatMessage(role="user", text=f"{self._context_prompt}\n{line_sep_memories}")
        ] if len(line_sep_memories)>0 else None)

Now we build a simple chat agent to test the new long-term memory module:

agent = OpenAILikeChatClient(
    model_id=Qwen3.MAX
).as_agent(
    name="assistant",
    instructions="You are a helpful assistant.",
    context_provider=LongTermMemory(),
)

async def main():
    thread = agent.get_new_thread()

    while True:
        user_input = input("\nUser: ")
        if user_input.startswith("exit"):
            break
        stream = agent.run_stream(user_input, thread=thread)
        print("Assistant: \n")
        async for event in stream:
            print(event.text, end="", flush=True)
        print("\n")


if __name__ == "__main__":
    asyncio.run(main())

Now we chat with the agent and see how it works:

From MLFlow we see that the retrieved memories go in as a separate message in the chat history:

Then we check Redis and see what memories we saved:

We can see that as the chat goes on, the new long-term memory module no longer stores and retrieves all chat history without filter. It keeps and retrieves only memories that matter to the user, and these memories give strong help in later chats.

Use concurrency to speed up

Everything looks fine except for the part where we use an LLM to extract memories that deserve saving.

The largest delay in an agent often comes from LLM calls. Now we add one more LLM call. We also need to wait for the LLM to decide whether to save memory before we go on with the real chat.

We can add some logs and see how much delay we add:

We add more than one second per chat turn.

One way to optimize is to use a smaller model like qwen3-8b. But the gain stays small. We save little time and hurt memory quality due to the smaller model.

Today I use a different way. I use concurrent programming so that the LLM call for memory extraction and the LLM call for user reply run at the same time.

Let us see the result after that change:

The time cost for extraction and storage of memory becomes almost nothing while the effect stays the same. How do we reach that?

If you built multi-agent workflows with LangGraph or LlamaIndex, you have likely seen the fan-out idea. It lets many nodes run at the same time, and then you take the final result.

The base idea uses the asyncio module in Python. You often see async and await when you write agent code. I wrote many posts in the past about asyncio and concurrency:

Use These Methods to Make Your Python Concurrent Tasks Perform Better

In short, when you face delays because of long IO calls, you can use concurrent programming.

Note: If you use mlflow.openai.autolog() to trace LLM calls, you may see that concurrent runs stop working. I still do not know why. I suggest you comment out MLFlow parts before you go on.

Back to our current case. We see that _save_memory is an async method. That means we can run it with concurrency.

How do we do that? Very simple. When we call _save_memory, we use asyncio.create_task to build a new concurrent task. That is all.

class LongTermMemory(ContextProvider):
    ...

    async def invoking(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        **kwargs: Any
    ) -> Context:
        ...
        asyncio.create_task(self._save_memory(messages, line_sep_memories))
        ...

Since real user chat often takes more time than memory extraction, we do not need to wait for that task in code. We only need to create the task.

With that, we add a memory extraction module that does not bring much extra delay to the agent system.

Conslusion

Redis now serves as standard infra for many companies. With RedisVL, it can cache and search information by semantics. This makes it easier to build short-term and long-term memory on top of Redis.

But if you build long-term memory with RedisVL API directly, you may not see good results. The system has no “brain” to judge which information deserves long-term storage and keeps value over time.

So in this tutorial, I first use an agent to extract useful memories and then write them into RedisVL. This improves the value of saved information. Long-term memory works much better now and fills the gap in my last post.

I also share a short guide on how to use concurrent programming so that many LLM calls run at the same time. This cuts system delay by a lot. If you like concurrency, you can read my old posts.

Thanks for reading. If you have any questions or ideas, leave me a note. I will reply as soon as I can.

Do not forget to subscribe to my blog and follow my new work in AI applications.

Also share this post with your friends. It may help more people.

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.

Tried RedisVL as a memory layer for AI agents: ✅ Short-term memory with MessageHistory: works great 😒 Long-term semantic memory with SemanticMessageHistory: not so much If you are thinking about RedisVL for agent memory, this will save you time:

Peng Qian — Thu, 29 Jan 2026 02:49:49 +0000

Peng Qian

Jan 29

Build Long-Term and Short-Term Memory for Agents Using RedisVL

#programming #ai #datascience #redis

Comments

9 min read

Build Long-Term and Short-Term Memory for Agents Using RedisVL

Peng Qian — Thu, 29 Jan 2026 02:48:18 +0000

Introduction

For this weekend note, I want to share some tries I made using RedisVL to add short-term and long-term memory to my agent system.

TLDR: RedisVL works pretty well for short-term memory. It feels a bit simpler than using the traditional Redis API. For long-term memory with semantic search, the experience is not good. I do not recommend it.

Why RedisVL?

Big companies like to use mature infrastructure to build new features.

We know mem0 and Graphiti are good open source software for long-term agent memory. But companies want to stay safe. Building new infrastructure costs money. It is unstable. It needs people who know how to run it.

So when Redis launched RedisVL with vector search, we naturally wanted to try it first. You can connect it to existing Redis clusters and start using it. That sounds nice. But is it really nice? We need to try it for real.

Today I will cover how to use MessageHistory and SemanticMessageHistory from RedisVL to add short-term and long-term memory to agents built on the Microsoft Agent Framework.

You can find the source code at the end of this article.

📫 Don’t forget to follow my blog to stay updated on my latest progress in AI application practices.

Preparation

Install Redis

If you want to try it locally, you can install a Redis instance with Docker.

docker run -d --name redis -p 6379:6379 -p 8001:8001 redis/redis-stack:latest

Cannot use Docker Desktop? See my other article.

A Quick Guide to Containerizing Agent Applications with Podman

The Redis instance will listen on ports 6379 and 8001. Your RedisVL client should connect to redis://localhost:6379. You can visit http://localhost:8001 in the browser to open the Redis console.

Install RedisVL

Install RedisVL with pip.

pip install redisvl

After installation, you can use the RedisVL CLI to manage your indexes and keep your testing neat.

rvl index listall

Implement Short-Term Memory Using MessageHistory

There are lots of “How to” RedisVL articles online, so let’s start straight from Microsoft Agent Framework and see how to use MessageHistory for short-term memory.

As in the official tutorial, you should implement a RedisVLMessageStore based on ChatMessageStoreProtocol.

class RedisVLMessageStore(ChatMessageStoreProtocol):
    def __init__(
        self,
        thread_id: str = "common_thread",
        top_k: int = 6,
        session_tag: str | None = None,
        redis_url: str | None = "redis://localhost:6379",
    ):
        self._thread_id = thread_id
        self._top_k = top_k
        self._session_tag = session_tag or f"session_{uuid4()}"
        self._redis_url = redis_url
        self._init_message_history()

In __init__ you should note two parameters.

thread_id is used for the name parameter when creating MessageHistory. I like to bind it to the agent. Each agent gets a unique thread_id.
session_tag lets you set a tag for each user so different sessions do not mix.

The protocol asks us to implement two methods list_messages and add_messages.

list_messages runs before the agent calls the LLM. It gets all available chat messages from the message store. It takes no parameters, so it cannot support long-term memory. More on that later.
add_messages runs after the agent gets the LLM’s reply. It stores new messages into the message store.

Here is how the message store works.

So in list_messages and add_messages, we just use RedisVL’s MessageHistory to do the job.

list_messages below uses get_recent to get top_k recent messages and turns them into ChatMessage.

class RedisVLMessageStore(ChatMessageStoreProtocol):
    ...

    async def list_messages(self) -> list[ChatMessage]:
        messages: list[dict[str, str]] = self._message_history.get_recent(
            top_k=self._top_k,
            session_tag=self._session_tag,
        )
        return [self._back_to_chat_message(message)
                for message in messages]

add_messages turns the ChatMessage into Redis messages and calls add_messages to store them.

class RedisVLMessageStore(ChatMessageStoreProtocol):
    ...

    async def add_messages(self, messages: Sequence[ChatMessage]):
        messages = [self._to_redis_message(message)
                    for message in messages]
        self._message_history.add_messages(
            messages,
            session_tag=self._session_tag
        )

That is short-term memory done with RedisVL. You may also implement deserialize, serialize and update_from_state for saving and loading the memory, but it is not important now. See the full code at the end.

Test RedisVLMessageStore

Let’s build an agent and test the message store.

agent = OpenAILikeChatClient(
    model_id=Qwen3.NEXT
).create_agent(
    name="assistant",
    instructions="You're a little helper who answers my questions in one sentence.",
    chat_message_store_factory=lambda: RedisVLMessageStore(
        session_tag="user_abc"
    )
)

Now a console loop for multi-turn dialog. Remember, Microsoft Agent Framework does not support short-term memory unless you use an AgentThread and pass it to run.

async def main():
    thread = agent.get_new_thread()
    while True:
        user_input = input("User: ")
        if user_input.startswith("exit"):
            break
        response = await agent.run(user_input, thread=thread)
        print(f"\nAssistant: {response.text}")
    thread.message_store.clear()

AgentThread when created calls the factory method to build the RedisVLMessageStore.

To check if the store works, we can use mlflow.openai.autolog() to see if messages sent to the LLM contain historical messages.

import mlflow
mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
mlflow.set_experiment("Default")
mlflow.openai.autolog()

See my other article for using MLFlow to track LLM calls.

Monitoring Qwen 3 Agents with MLflow 3.x: End-to-End Tracing Tutorial

Let’s open the Redis console to see the cache.

As you can see, after using MessageHistory as MAF's message store, we can implement multi-turn conversations with historical messages.

With thread_id and session_tag parameters, we can also implement the feature that lets users switch between multiple conversation sessions, like in popular LLM chat applications.

Feels simpler than the official RedisMessageStore solution right?

Implement Long-Term Memory Using SemanticMessageHistory

SemanticMessageHistory is a subclass of MessageHistory. It adds a get_relevant method for vector search.

Example:

prompt = "what have I learned about the size of England?"
semantic_history.set_distance_threshold(0.35)
context = semantic_history.get_relevant(prompt)
for message in context:
    print(message)

Batches: 100%|██████████| 1/1 [00:00<00:00, 56.30it/s]
{'role': 'user', 'content': 'what is the size of England compared to Portugal?'}

Compared to MessageHistory the big thing here is that we can get the most relevant historical messages based on the user request.

You might think that if MessageStore short-term memory is nice, then SemanticMessageHistory with semantic search must be even better.

From my experience, this is not the case.

From my test results, it is not like that. Let’s now make a long-term memory adapter for Microsoft Agent Framework using SemanticMessageHistory and see the result.

Use SemanticMessageHistory in Microsoft Agent Framework

Earlier I said list_messages in ChatMessageStoreProtocol has no parameters, so we cannot search history. Thus, we cannot use MessageStore for long-term memory.

Microsoft Agent Framework has a ContextProvider class. From its name, it is for context engineering.

So we should build long-term memory on this class.

class RedisVLSemanticMemory(ContextProvider):
    def __init__(
        self,
        thread_id: str | None = None,
        session_tag: str | None = None,
        distance_threshold: float = 0.3,
        redis_url: str = "redis://localhost:6379",
        embedding_model: str = "BAAI/bge-m3",
        embedding_api_key: str | None = None,
        embedding_endpoint: str | None = None,
    ):
        self._thread_id = thread_id or "semantic_thread"
        self._session_tag = session_tag or f"session_{uuid4()}"
        self._distance_threshold = distance_threshold
        self._redis_url = redis_url
        self._embedding_model = embedding_model
        self._embedding_api_key = embedding_api_key or os.getenv("EMBEDDING_API_KEY")
        self._embedding_endpoint = embedding_endpoint or os.getenv("EMBEDDING_ENDPOINT")
        self._init_semantic_store()

ContextProvider has two methods invoked and invoking.

invoked runs after LLM call. It stores the latest messages in RedisVL. It has both request_message and response_messages parameters but stores them separately.
invoking runs before LLM call. It uses the user’s current input to search for relevant history in RedisVL and returns a Context object.

The Context object has three variables.

instructions string. The agent adds this to the system prompt.
messages list. Put history messages found in long-term memory here.
tools list for functions. The agent adds these tools to its ChatOptions.

Since we want to use vector search to get relevant history, we put those messages in messages. The order between MessageStore messages and ContextProvider messages matters. Here is the order of their calls.

Setting up a TextVectorizer

Semantic vector search needs embeddings. We must set up a vectorizer.

In __init__ besides thread_id and session_tag we set the embedding model info.

class RedisVLSemanticMemory(ContextProvider):
    ...
    def _init_semantic_store(self) -> None:
        if not self._embedding_api_key:
            vectorizer = HFTextVectorizer(
                model=self._embedding_model,
            )
        else:
            vectorizer = OpenAITextVectorizer(
                model=self._embedding_model,
                api_config={
                    "api_key": self._embedding_api_key,
                    "base_url": self._embedding_endpoint
                }
            )

        self._semantic_store = SemanticMessageHistory(
            name=self._thread_id,
            session_tag=self._session_tag,
            distance_threshold=self._distance_threshold,
            redis_url=self._redis_url,
            vectorizer=vectorizer,
        )

I can choose a server-hosted embedding model with OpenAI API or a local HuggingFace model, depending on whether embedding_api_key is set.

Implement invoked and invoking methods

invoked is easy. As said SemanticMessageHistory stores request and response separately. I merge them into one list, then call add_messages.

class RedisVLSemanticMemory(ContextProvider):
    ...
    async def invoked(
        self,
        request_messages: ChatMessage | Sequence[ChatMessage],
        response_messages: ChatMessage | Sequence[ChatMessage] | None = None,
        invoke_exception: Exception | None = None,
        **kwargs: Any,
    ) -> None:
        if isinstance(request_messages, ChatMessage):
            request_messages = [request_messages]
        if isinstance(response_messages, ChatMessage):
            response_messages = [response_messages]
        chat_messages = request_messages + response_messages
        messages = [self._to_redis_message(message)
                    for message in chat_messages]
        self._semantic_store.add_messages(
            messages=messages,
            session_tag=self._session_tag,
        )

invoking below:

class RedisVLSemanticMemory(ContextProvider):
    ...
    async def invoking(
        self,
        messages: ChatMessage | MutableSequence[ChatMessage],
        **kwargs: Any
    ) -> Context:
        if isinstance(messages, ChatMessage): # 1
            messages = [messages]
        prompt = "\n".join([message.text
                            for message in messages])
        context = self._semantic_store.get_relevant(
            prompt=prompt,
            raw=True,
            session_tag=self._session_tag,
        )
        context = sorted(context, key=lambda m: m['timestamp']) # 2
        relevant_messages = [self._back_to_chat_message(message)
                             for message in context]
        print([m.text for m in relevant_messages])
        return Context(messages=relevant_messages) # 3

Points to note.

The messages parameter may be a list for multi-modal input. Merge all text.
Since messages are stored separately, I need to sort them by timestamp to keep order.
Put the retrieved messages into Context.messages so they go to the end of the current chat messages.

Test semantic memory

Unlike message store, we can set ContextProvider directly in the agent.

memory_provider = RedisVLSemanticMemory(
    session_tag="user_abc",
    distance_threshold=0.3,
)
agent = OpenAILikeChatClient(
    model_id=Qwen3.NEXT
).create_agent(
    name="assistant",
    instructions="You're a little helper who answers my questions in one sentence.",
    context_providers=memory_provider,
)

Now a main with a thread instance to keep short-term memory while testing multi-turn dialog.

async def main():
    thread = agent.get_new_thread()
    while True:
        user_input = input("User: ")
        if user_input.startswith("exit"):
            break
        response = await agent.run(user_input, thread=thread)
        print(response.text)
    memory_provider.clear()

Test result:

It seems the default value of distance_threshold 0.3 is too high. Let's set it lower:

memory_provider = RedisVLSemanticMemory(
    session_tag="user_abc",
    distance_threshold=0.2,
)

Test again:

Lower threshold stops unrelated messages. But since requests and responses are stored separately, only requests are found. ContextProvider puts retrieved messages at the end of the message list. The LLM may think the user asked two questions. MLFlow shows it.

This is bad. We care more about the LLM’s answers than the requests. But vector search often finds the questions, not the answers. This just adds useless questions and does not help the LLM answer.

Hard to say if the fault is Microsoft Agent Framework or RedisVL.

When ContextProviderlong-term finds related chat messages, they go after the ones from message store. If long-term and short-term messages repeat, they can confuse the LLM.

Also, RedisVL not storing requests and responses together is a choice I do not like. LLM responses cost more. In production, a response may involve web search, RAG retrieval, or running code. But vector search finds just the request, not the answer. That is a waste.

Conclusion

Today, we tried using RedisVL for short-term and long-term memory in Microsoft Agent Framework and checked the results.

RedisVL is very handy for short-term agent memory. It is simpler than using the Redis API.

But SemanticMessageHistory for semantic search of the user history did not perform well. I explained why.

Thanks to the solid Redis infrastructure, semantic caches with RedisVL are simpler than other vector solutions.

Next time, I will show you a semantic cache with RedisVL to save big costs for your company.

Share your thoughts in the comments.

👉 Subscribe to my blog to follow my latest agent app work.

And share this article with friends. Maybe it will help more people.😁

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.

🎯As companies keep rolling out agent systems internally, issues like prompt injection attacks and tricking agents into breaking the rules are starting to pop up. It’s becoming urgent to build compliance safeguards for these agents.

Peng Qian — Fri, 09 Jan 2026 11:12:51 +0000

Microsoft Agent Framework (MAF) Middleware Basics: Add Compliance Fences to Your Agent

Peng Qian ・ Jan 9

#programming #ai #tutorial #agentaichallenge

Microsoft Agent Framework (MAF) Middleware Basics: Add Compliance Fences to Your Agent

Peng Qian — Fri, 09 Jan 2026 10:22:10 +0000

Introduction

Microsoft added a middleware feature to their Agent Framework (MAF). This means you can use the chain-of-responsibility design pattern to add extra logic before or after the agent runs, function tools are called, or when the LLM is invoked, without changing the original business logic.

This feature matters a lot.

When building enterprise-level agent applications, teams typically collaborate across departments. Besides your own part, you might need to dynamically include permissions, logs, finance checks, and compliance reviews from other teams.

These parts shouldn’t affect the agent’s core ability, but should be easy to install or remove. Like middleware in FastAPI or other web frameworks, MAF middleware enables this capability for agents as well.

In today’s guide, I’ll show how I use MAF’s middleware and AG-UI to add a compliance review that checks user input before sending it to the agent.

This will teach you how to use middleware in enterprise agent applications and give you a first look at using AG-UI for microservice distributed agent development. Let’s start.

You can get all the source code at the end. 👇

📫 Don’t forget to follow my blog to stay updated on my latest progress in AI application practices.

System Setup

Install the latest Microsoft Agent Framework

MAF is still updating quickly. Since APIs change a lot, this guide uses the newest version. It’s better to install the prerelease version.

pip install agent-framework --pre

Or add the dependency in your pyproject.toml file:

"agent-framework-ag-ui>=1.0.0b251223"

Install Microsoft Agent Framework AG-UI

MAF works with AG-UI to support distributed agent development. You’ll need this capability today, so install the latest version of ag-ui; otherwise, APIs won’t match up.

"agent-framework-ag-ui>=1.0.0b251223"

After installing the needed Python packages, we can move on. First, let's get a quick background on what middleware is and what it can do.

Quick Intro to MAF Middleware

What is middleware

According to the MAF documentation:

Middleware in the Agent Framework intercepts, changes, and enhances >agent behavior at different execution points. You can use it for logging, >security checks, error handling, and result transformation without >changing the agent’s or function’s core logic.

That’s what we’ll learn today.

How middleware works

As I said before, MAF middleware uses the chain-of-responsibility pattern. Each piece of logic lives in its own node. Every node knows the next one. When a node finishes running, it passes control to the next node.

Here’s a simple example:

async def logging_agent_middleware(
    context: AgentRunContext,
    next: Callable[[AgentRunContext], Awaitable[None]]
) -> None:
    print("[Agent] Starting execution")
    await next(context)
    print("[Agent] Execution completed")

The next parameter points to the next node. You can run code before or after calling it.

The actual agent logic acts as the last node. After all middleware nodes finish, the agent runs.

In MAF, middleware can run in three stages:

Before or after run or run_stream.
Before or after a function call.
Before or after calling the LLM.

Now let’s look at how different middleware types work.

Function-Based middleware

If your middleware is simple, like just logging agent runs, use a function-based middleware.

You only need a function with two parameters: context and next. The context keeps your runtime info, and next calls the next node.

MAF uses the type annotation of context to tell which stage this code belongs to. For example, if it runs at the agent stage, the type should be AgentRunContext:

async def logging_agent_middleware(
    context: AgentRunContext,
    next: Callable[[AgentRunContext], Awaitable[None]]
) -> None:
    print("[Agent] Starting execution")
    await next(context)
    print("[Agent] Execution completed")

For a function call stage, use FunctionInvocationContext:

async def logging_function_middleware(
    context: FunctionInvocationContext,
    next: Callable[[FunctionInvocationContext], Awaitable[None]],
) -> None:
    print(f"[Function] Calling {context.function.name}")
    await next(context)
    print(f"[Function] {context.function.name} completed")

And for the chat stage, use ChatContext:

async def logging_chat_middleware(
    context: ChatContext,
    next: Callable[[ChatContext], Awaitable[None]],
) -> None:
    print(f"[Chat] Sending {len(context.messages)} messages to AI.")
    await next(context)
    print(f"[Chat] AI response received.")

If you dislike type annotations, you can use decorators.

@agent_middleware runs at the agent stage:

@agent_middleware    
async def logging_agent_middleware(context, next) -> None:
    print("[Agent] Starting execution")
    await next(context)
    print("[Agent] Execution completed")

Then you don’t need to add type annotations anymore.

There are also @function_middleware and @chat_middleware for function calls and chat calls.

If your middleware needs to save state or handle more complex logic, function-based won’t be enough. Use class-based middleware.

Class-Based middleware

Class-based middleware organizes code with object-oriented methods. That lets middleware remember state and handle tricky logic.

A class-based middleware must meet two rules:

Inherit from the right base class: AgentMiddleware, FunctionMiddleware, or ChatMiddleware.
Have a process method with the same parameters as the function-based ones. They use the same contexts.

Here’s an example for a middleware class that runs at the function call stage:

class LoggingFunctionMiddleware(FunctionMiddleware):
    async def process(
        self,
        context: FunctionInvocationContext,
        next: Callable[[FunctionInvocationContext], Awaitable[None]]
    ) -> None:
        print(f"[Function Class] Calling {context.function.name}")
        await next(context)
        print(f"[Function Class] {context.function.name} completed.")

Just make sure to pair the right base class with the right context type. The others follow the same rule.

How to use middleware

There are three stages for middleware and three ways to build it. Let’s put that in one grid chart to see how they connect.

The framework now only supports passing middleware when creating the agent:

agent = chat_client.create_agent(
    name="assistant",
    instructions="You are a helpful assistant",
    tools=[get_weather],
    middleware=[
        logging_agent_middleware,
        LoggingFunctionMiddleware(),  logging_chat_middleware,
        blocking_middleware, logging_function_middleware,
    ]
)

You can mix all nine types freely.

But note that only the last function middleware you add actually works right now. I’m not sure if that’s a bug, but we’ll find out later.

Project Practice: Add Compliance Check to Your Agent

Now let’s get hands-on. I’ll show how to use MAF middleware to add compliance checking to an agent.

Why add compliance checks

Every LLM already has basic compliance setups built in based on local laws. When companies self-host LLMs, they also add custom checks in frameworks like vLLM. But those only watch the model’s input or output.

Now that agents are everywhere, we also need checks at the agent level: preventing prompt injection, checking MCP permissions, and so on. Middleware makes this possible.

In today’s demo, we’ll review every user message to make sure no one tries to make our finance assistant promise investment returns.

In the end, the agent will refuse to answer questions like “Will I lose money?” or “Can you guarantee profit?”

How will you do it

Why use compliance checks as an example? Because in real web apps, product teams don’t manage compliance themselves. The compliance department creates the rules and sends them as microservices to each product.

That way, teams don’t touch those rules. They just plug them in using framework middleware. It’s common in normal web apps.

We’ll do the same with MAF agents, using middleware to insert compliance logic.

To simulate real setups, this project has two parts: one server and one client.

On the compliance department side, we’ll deploy a separate agent that reviews messages. It uses an LLM to check user inputs for prompt injections or non-compliant content.

On the business side, we’ll have a middleware that intercepts user requests and sends them to that server. It decides whether the agent should respond.

The two parts communicate using the AG-UI protocol.

Server implementation

Let’s build the compliance-checking agent server.

Since it only checks user requests, I’ll use the Qwen3-30b-a3b-instruct-2507 model for speed.

agent = OpenAILikeChatClient(
    model_id=Qwen3.Q30B_A3B
).create_agent(
    name="Assistant",
    instructions=dedent("""
    You are a compliance review officer. You will review user requests or system-generated text for compliance.
    Your main task is to check user requests and determine whether they are trying to induce the system to produce content that guarantees investment returns or similar topics.

    You should output a JSON text, like {"is_compliance": 1, "reason": ""}

    Here, is_compliance being 1 means compliant, and 0 means non-compliant.

    reason should state the reason for compliance or non-compliance.

    Only output the JSON text without any markdown formatting, and do not add any introduction or explanation.
    """),
)

We’ll make the output structured as JSON for clarity and speed.

Although MAF supports structured output when using Qwen models:

Make Microsoft Agent Framework’s Structured Output Work With Qwen and DeepSeek Models

For some reason, it doesn’t work when used as an AG-UI server.

So we have to tell the format in the prompt.

Next, use add_agent_framework_fastapi_endpoint from agent_framework_ag_ui to register it with FastAPI.

app = FastAPI(title="AG-UI Server")
add_agent_framework_fastapi_endpoint(app, agent, "/compliance")

Finally, run it with uvicorn:

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8888)

Middleware implementation

This middleware is more complex, so we’ll use class-based middleware.

Here’s the full code:

class ComplianceCheckMiddleware(ChatMiddleware):
    def __init__(self, *args, **kwargs):
        self._init_compliant_agent()
        super().__init__(*args, **kwargs)

    async def process(
        self,
        context: ChatContext,
        next: Callable[[ChatContext], Awaitable[None]],
    ):
        check_result: ReviewResults = await self._get_compliance_result(context)
        if not check_result.is_compliance:
            self._output_result(
                context,
                f"😒We can’t keep providing the service because:\n{fill(check_result.reason)}")
            return

        await next(context)

    @staticmethod
    def _output_result(context: ChatContext, response: str) -> None:
        if context.is_streaming: #4
            async def output_stream() -> AsyncIterable[AgentRunResponseUpdate]:
                yield AgentRunResponseUpdate(contents=[TextContent(text=response)])
            context.result = output_stream()
        else:
            context.result = AgentRunResponse(
                messages=[ChatMessage(role=Role.ASSISTANT, text=response)]
            )

    async def _get_compliance_result(self, context: ChatContext) -> ReviewResults:
        messages = [message for message in context.messages if message.role.value == "user"][-5:]
        response = await self.agent.run(messages) #2

        check_result = ReviewResults.model_validate_json(response.text) #3
        return check_result

    def _init_compliant_agent(self) -> None:
        client = AGUIChatClient(  #1
            endpoint="http://127.0.0.1:8888/compliance"
        )
        self.agent = client.create_agent(
            name="compliance_agent",
            instructions="You’re a compliance officer, and you review user requests."
        )

A few details to watch:

_init_compliant_agent creates the AG-UI client but works just like a normal chat client.
I sent recent user messages for better review accuracy. But the AgentMiddleware context only holds the latest message. To get the message history, you must use ChatMiddleware.
Since AG-UI doesn’t support response_format, I parse JSON manually.
_output_result sends text output if a check fails. It switches based on context.is_streaming.

Now we can make a business agent. Use a bigger model and a normal system prompt; just remember to load the ComplianceCheckMiddleware.

chat_client = OpenAILikeChatClient(model_id=Qwen3.NEXT)
agent = chat_client.create_agent(
    name="chat_assistant",
    instructions="You are a helpful assistant. Answer the user's question in short and simple words.",
    middleware=[ComplianceCheckMiddleware()]
)

Let’s test it with a multi-turn chat client:

async def main():
    thread = agent.get_new_thread()
    while True:
        user_input = input("\nUser: ")
        if user_input.startswith("exit"):
            break
        stream = agent.run_stream(user_input, thread=thread)
        print("\nAssistant: ")
        async for event in stream:
            print(event.text, end="", flush=True)
        print()

You’ll see the agent chats normally most of the time.

If you ask about guaranteed returns, it refuses to answer but continues working fine afterward.

Task done.

Conclusion

In this guide, we explored how middleware works in Microsoft Agent Framework.

Middleware lets us add new logic for logging, permissions, or compliance without touching the main agent code or prompt text.

In the project section, I used class-based middleware to show how to review user inputs for compliance.

We also took a quick look at AG-UI for building agent microservices. This helps when many teams need to make agents collaborate, and I’ll cover AG-UI and A2A in detail later.

If you have questions or want to learn more, leave a comment.

Don’t forget to subscribe to my blog and share this article with your friends—maybe it’ll help someone build smarter agents 😁.

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.

My Agent System Looks Powerful but Is Just Industrial Trash

Peng Qian — Tue, 30 Dec 2025 06:42:22 +0000

This weekend note is a bit late because Phase One of my Deep Data Analyst project failed for now. That means I can’t continue the promised Data Analyst Agent tutorial.

What Happened?

I actually built a single-agent data analysis assistant based on the ReAct pattern.

This assistant could take a user’s analysis request, come up with a reasonable hypothesis, run EDA and modeling on the uploaded dataset, give professional business insights and actionable suggestions, and even create charts to back up its points.

If you’re curious about how it worked, here’s a screenshot that shows how cool it looked:

After all, this was just a single-agent app. It wasn’t that hard to build. If you remember, I explained how I used a ReAct agent to solve the Advent of Code challenges. Here’s that tutorial:

How I Crushed Advent of Code And Solved Hard Problems Using Autogen Jupyter Executor and Qwen3

If you tweak that agent’s prompt a bit, you can get the same kind of data analysis ability I’m talking about.

Why Do I Call It a Failure?

Because my agent, like most that AI hobbyists build, is just one of those:

Perfect for impressing your boss with a beautiful, powerful prototype, but once real users try it, it suddenly breaks down and becomes industrial trash.

Why Do I Say That?

My agent has two serious problems.

1. Very poor robustness

This is the top feedback I got after giving it to analyst users.

If you try it once, it looks amazing. It uses methods and technical skills beyond a regular analyst to give you a very professional argument. You’d think replacing humans with AI was the smartest move you've ever made.

But data analysis is about testing cause and effect over time. You must run the same analysis daily or weekly to see if the assistant’s advice actually works.

Even with the same question, the agent changes its hypotheses and analysis methods each run. It then gives different advice each time.

That’s what I mean by poor stability and consistency.

Imagine you ask it to use an RFM model to segment your users and give marketing suggestions. Before a campaign, it uses features A, B, C and makes five levels for each. After the campaign, it suddenly adds a derived metric D and now segments on A, B, C, D.

You couldn’t even run an A/B test properly.

2. It suffers from context position bias

If you’ve read my earlier posts, you know my Data Analyst agent runs code through a stateful Jupyter Kernel-based interpreter.

Exclusive Reveal: Code Sandbox Tech Behind Manus and Claude Agent Skills

This lets the agent act like a human analyst, first making a hypothesis, running code in a Jupyter notebook to test it, and then coming up with a new hypothesis based on results — iterating over and over.

This gives the agent strong autonomous exploration and error-recovery skills.

But here’s the problem. In a past post, I mentioned that LLMs have position bias when dealing with long conversation histories:

Fixing the Agent Handoff Problem in LlamaIndex's AgentWorkflow System

In short, LLMs don’t treat each message fairly. They don’t weight importance by recency like you think they would.

As we keep making and testing hypotheses, the history grows. Each message in it matters. The first shows the data structure, a later one proves a hypothesis wrong, so we skip it next time — all important.

The LLM doesn’t see it this way. As the process goes, it starts focusing on wrong messages while ignoring the ones that have been fixed. So it repeats mistakes.

This either wastes tokens and time or sends the analysis off-track into another topic. Neither is good.

So Phase One of my data analysis agent is done.

Any Ways to Fix It?

Build a multi-agent system with atomic skills

For robustness, you’d probably think of using a Context Engineer to lock in the plan and metric definitions before analysis starts.

Also, when an analysis works well, we should save the plan and prior assumptions in long-term memory.

Both mean giving the agent new skills.

But remember, my agent is based on ReAct, which means its prompt is already huge — over a thousand lines now.

Adding anything risks breaking this fragile system and disrupting prompt-following.

So a single agent won’t cut it. We should split the system into multiple agents with atomic skills, then use some orchestration to bring them together.

We can imagine this multi-agent app as a coordinate system with at least these agents:

Issue Clarification Agent — asks the user questions to clarify the problem, confirm metrics, and scope.
Retrieval Agent — pulls metric definitions and calculation methods from a knowledge base, plus analysis methods written by real data scientists.
Planner Agent — proposes prior hypotheses, sets an analysis approach, and makes a full plan to keep later agents on track.
Analyst Agent — breaks the plan into steps, uses Python to execute them, and tests the prior hypotheses.
Storyteller Agent — turns complex technical results into engaging business stories and actionable advice for decision-makers.
Validator Agent — ensures the whole process is correct, reliable, and business-compliant.
Orchestrator Agent — manages all the agents and assigns tasks.

Choose the right agent framework

We need an agent framework that supports message passing. When a new task comes up or an agent finishes, a message should go to the orchestrator. The orchestrator should also send tasks by message.

The framework should support context state saving. Agents’ intermediate results should go to the context, not all to the LLM, so position bias doesn’t get in the way.

If you ask GPT, it will recommend LangGraph and Autogen.

I’d skip LangGraph. Even though its workflow is fine, its agents still run on LangChain, which I just don’t like.

When people compare Autogen with others, they say Autogen is better for research-heavy tasks like data analysis that need more autonomy.

But Autogen’s Selector Group Chat, while good for orchestrators, can’t manage message history well. You can’t control what goes to the LLM, and orchestration is a black box.

Autogen’s GraphFlow is also half-baked. Workflow only supports agent nodes and no context state management.

I Used Autogen GraphFlow and Qwen3 Coder to Solve Math Problems — And It Worked

The bigger risk: Autogen has stopped development. For a 50k-star agent framework, that’s a shame.

What about Microsoft Agent Framework (MAF)?

I like it. Easy to use, takes good ideas from earlier frameworks, and avoids their mistakes.

I’m ready to use it with Qwen3 and DeepSeek:

Make Microsoft Agent Framework’s Structured Output Work With Qwen and DeepSeek Models

I’m studying MAF’s Workflow feature now. It’s nice: multiple node types, context state management, OpenTelemetry observability, and orchestration modes like Switch-Case and Multi-Selection. It has almost everything I want.

It also feels ambitious. With new abilities like MCP, A2A, AG-UI, and Microsoft backing it, MAF should have a better long-term future than Autogen.

My Next Steps

I’m reading MAF’s user guide and source now. I’ll start using it in my agent system.

I’m still working on Deep Data Analyst. After switching frameworks, I’ll need to adapt things for a while.

The good news: a multi-agent system lets me add skills step by step, so I can share and show progress anytime instead of waiting until the whole project is done. 😂

I also want to explore Workflow’s potential in MAF. I’ll see if it can handle different AI agent design patterns. That will help us understand how to use this promising framework.

What are you interested in? Leave me a comment.

Don’t forget to subscribe to my newsletter Mr.Q’s Weekend Notes to get my latest agent research in your inbox without waiting.

And share my blog with your friends — maybe it can help more people.

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.

Make Microsoft Agent Framework’s Structured Output Work With Qwen and DeepSeek Models

Peng Qian — Mon, 15 Dec 2025 02:33:06 +0000

Introduction

Today, we’ll add some extra features to the Microsoft Agent Framework so that Qwen and DeepSeek can also utilize structured output.

The main reason is that Autogen has stayed on version v0.75 for a long time, which makes it necessary to switch to Microsoft Agent Framework soon.

Every time we switch the agent framework, we have to make it work with some common LLMs. This time is no exception. Luckily, Microsoft Agent Framework is pretty easy to use. We just need to adapt the structured output feature, and we can use it right away.

As usual, I’ll put the source code at the end of the article for you to use.

Background On Structured Output

How does Agent Framework do structured output?

In Microsoft Agent Framework, we set the response_format parameter to a Pydantic BaseModel data class to tell the LLM to produce structured output, like this:

from pydantic import BaseModel

class PersonInfo(BaseModel):
    """Information about a person."""
    name: str | None = None
    age: int | None = None
    occupation: str | None = None

response = await agent.run(
    "Please provide information about John Smith, who is a 35-year-old software engineer.",
    response_format=PersonInfo
)

There are two places to set the response_format parameter:

Set it during the ChatAgent initialization. This becomes a global parameter for the agent, and all later communications with OpenAI-compatible models use it.
Set it when calling run or run_stream. This works only for that single API call.

The response_format set in run or run_stream is higher priority than the setting in the ChatAgent creation. That means the response_format in run will override what was set when creating the ChatAgent.

By default, we use OpenAIChatClient to call OpenAI’s API. Before the API call, a _prepare_options method converts the BaseModel into {"type": "json_schema", "json_schema": <base model schema>} and passes it to the LLM.

So that’s how Agent Framework makes the LLM do structured output. Our extension will go into the _prepare_options method of OpenAIChatClient.

Do Qwen and DeepSeek support json_schema settings?

According to the official docs, both Qwen and DeepSeek support structured output. But they only support setting the OpenAI client’s response_format to {"type": "json_object"} and require the keyword json in the prompt to enable structured output. They do not support OpenAI’s API way of setting response_format to json_schema.

If we don’t extend the Microsoft Agent Framework and force response_format to be a BaseModel class, we’ll see errors like this:

Error code: 400 - {'error': {'message': "<400> InternalError.Algo.InvalidParameter: 'messages' must contain the word 'json' in some form, to use 'response_format' of type 'json_object'.", 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_parameter_error'}}

So for Qwen and DeepSeek, without modifying the Microsoft Agent Framework, we can’t use the structured output feature.

How to make Qwen and DeepSeek output using json_schema

Even though Qwen and DeepSeek don’t support {"type": "json_schema"}, we can still inject json_schema into the system prompt so the LLM outputs according to our data class.

The trick is: before calling the OpenAI API, convert the BaseModel to its json_schema, attach it to the system prompt, and send it along.

If you want to know exactly how I made Qwen output according to a Pydantic BaseModel’s rules, read my popular article where I explain multiple methods for this:

Build AutoGen Agents with Qwen3: Structured Output & Thinking Mode

How I Extended It

Now, let’s see exactly how to extend Microsoft Agent Framework so Qwen and DeepSeek can do structured output.

I know you want the answer fast, so here’s the modified code you can use right now:

from typing import override, MutableSequence, Any
from textwrap import dedent
from copy import deepcopy

from pydantic import BaseModel
from agent_framework.openai import OpenAIChatClient
from agent_framework import ChatMessage, ChatOptions, TextContent

class OpenAILikeChatClient(OpenAIChatClient):
    @override
    def _prepare_options(
            self,
            messages: MutableSequence[ChatMessage],
            chat_options: ChatOptions) -> dict[str, Any]:
        chat_options_copy = deepcopy(chat_options) # 1
        if (
            chat_options.response_format
            and isinstance(chat_options.response_format, type)
            and issubclass(chat_options.response_format, BaseModel)
        ):
            structured_output_prompt = (
                self._build_structured_prompt(chat_options.response_format)) # 2

            if len(messages) >= 1: # 3
                first_message = messages[0]
                if str(first_message.role) == "system": # 4
                    new_system_message = ChatMessage(
                        role="system",
                        text=f"{first_message.text} {structured_output_prompt}"
                    )
                    messages = [new_system_message, *messages[1:]]
                else:
                    new_system_message = ChatMessage( # 5
                        role="system",
                        text=f"{structured_output_prompt}"
                    )
                    messages = [new_system_message, *messages]

            chat_options_copy.response_format = {"type": "json_object"}
        return super()._prepare_options(messages, chat_options_copy)

    @staticmethod
    def _build_structured_prompt(response_format: type[BaseModel]) -> str:
        json_schema = response_format.model_json_schema()
        structured_output_prompt = dedent(f"""
        \n\n
        <output-format>\n
        Your output must adhere to the following JSON schema format,
        without any Markdown syntax, and without any preface or explanation:\n
        {json_schema}\n
        </output-format>
        """)

        return structured_output_prompt

As I said before, both run and run_stream call OpenAIChatClient’s _prepare_options method, so it’s the best place to extend.

I marked each part of the code with numbers in the comments so I can explain in order:

The chat_options object is the parameters you pass to the method. We need to deepcopy it to a new object because we’re going to change response_format to {"type": "json_object"} to work with DeepSeek. Agent Framework still needs the original BaseModel to convert the returned JSON string back to a data class.
Then we take the json_schema from the BaseModel, turn it into part of the system prompt, and wrap it with xml tags.
The original _prepare_options checks if messages is empty. We’ll only handle the case where messages is not empty, meaning the user sends at least a user message.
If the first message in messages is a system message, we attach the structured output prompt to the system message, replacing the old system message.
If the first message is a user message, we create a new system message with just the structured output prompt and put it at the front of the messages list.

With this change, Microsoft Agent Framework now supports structured output for Qwen and DeepSeek. Next, let’s test some common cases to make sure it works.

Testing the Extension

Prepare an MLflow server to observe

Before testing, we need a monitoring tool to check the messages Agent Framework sends to the LLM API.

Agent Framework supports logging platforms based on opentelemetry, but it doesn’t log system messages by default, so that won’t work for our case today.

In a previous article, I showed how I use MLflow to see the messages sent to OpenAI’s API:

Monitoring Qwen 3 Agents with MLflow 3.x: End-to-End Tracing Tutorial

So today we’ll still use MLflow’s openai.autolog API, because it can record system messages sent to the LLM.

You just need to start a server like this:

mlflow server --host 0.0.0.0 --port 5000

Then in the test code, add a call to openai.autolog:

mlflow.set_tracking_uri(os.environ.get("MLFLOW_TRACKING_URI"))
mlflow.set_experiment("Default")
mlflow.openai.autolog()

Test single-turn conversation

First, let’s follow the official docs to test normal structured output.

Set up a data class, then set it in the run method:

class PersonInfo(BaseModel):
    """Information about a person."""
    name: str | None = None
    age: int | None = None
    occupation: str | None = None


async def main():
    response = await agent.run(
        "Please provide information about John Smith, who is a 35-year-old software engineer.",
        response_format=PersonInfo,
    )
    print(response.text)

Check on MLflow:

We can see the data class has been turned into a json_schema prompt, attached to the system prompt. Also, we can get the structured object directly through response.value.

Test multi-turn conversation

Now let’s test Microsoft Agent Framework’s multi-turn example.

First, set a response_formatat create_agent, without setting it in run:

class OutText(BaseModel):
    output: str

agent = client.create_agent(
    instructions="You are a good assistant.",
    name="assistant",
    response_format=OutText,
)

async def main():
    result1 = await agent.run(
        "How many kilometers is the highway from Wuhan to Beijing?",
        thread=thread,
    )
    print(result1.text)

Then use run_stream for the second turn and set another response_format:

class ETA(BaseModel):
    hours: int

final_response = await AgentRunResponse.from_agent_response_generator(
    agent.run_stream(
        "How long would it take to drive there at 120 km/h?",
        thread=thread,
        response_format=ETA,
    ),
    output_format_type=ETA
)
print(final_response.value)

Check on MLflow:

No problems at all. The response_format in run_stream overrides the one set in create_agent as expected.

Conclusion

With Autogen no longer updated, we’ve started moving to Microsoft Agent Framework.

During this migration, we extended the Microsoft Agent Framework so Qwen and DeepSeek can use structured output.

I hope Qwen and DeepSeek’s APIs will one day support setting response_format to {"type": "json_schema"} directly, so we wouldn’t have to adapt the framework every time we switch.

Structured output is just about adding a json_schema description in the system prompt so the LLM outputs content as we define. So even if you’re not using Microsoft Agent Framework, you can modify things in a similar way.

That’s it for today’s journey. If you find this tutorial useful, please share it with your friends.

And feel free to follow my blog so you can keep up with my latest progress in AI Agents.

Enjoyed this read? Subscribe now to get more cutting-edge data science tips straight to your inbox! Your feedback and questions are welcome — let’s discuss in the comments below!

This article was originally published on Data Leads Future.