<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yaohua Chen</title>
    <description>The latest articles on DEV Community by Yaohua Chen (@chen115y).</description>
    <link>https://dev.to/chen115y</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2671324%2F3ca83c79-d0fd-40c0-b5ac-349326e71725.jpg</url>
      <title>DEV Community: Yaohua Chen</title>
      <link>https://dev.to/chen115y</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/chen115y"/>
    <language>en</language>
    <item>
      <title>A Claude Code Skills Stack: How to Combine Superpowers, gstack, and GSD Without the Chaos</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:30:15 +0000</pubDate>
      <link>https://dev.to/imaginex/a-claude-code-skills-stack-how-to-combine-superpowers-gstack-and-gsd-without-the-chaos-44b3</link>
      <guid>https://dev.to/imaginex/a-claude-code-skills-stack-how-to-combine-superpowers-gstack-and-gsd-without-the-chaos-44b3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;One article to compare the frameworks, see where they overlap, and land on a stable three-layer practice.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Claude Code has quickly become one of the most widely adopted AI coding tools. Individual developers, startups, and large engineering teams alike have integrated it into their daily workflows—writing production code, reviewing pull requests, debugging, and shipping features at a pace that was hard to imagine a year ago. As usage has scaled, so has the ecosystem around it. &lt;strong&gt;Claude Skills&lt;/strong&gt;—composable, auto-invoked instruction sets that shape how the agent plans, builds, and verifies—have emerged as one of the most important extension points in Claude Code. They let you go beyond one-off prompts and encode &lt;strong&gt;repeatable workflows&lt;/strong&gt; directly into the agent's behavior. In fact, Anthropic has doubled down on this direction: the latest version of Claude Code &lt;strong&gt;consolidates the previously separate "slash commands" and "skills" systems into a single, unified skills format&lt;/strong&gt;, signaling that skills are now the canonical way to extend the agent.&lt;/p&gt;

&lt;p&gt;With Skills now central to the experience, the community has rallied around a handful of open-source frameworks that package best practices into ready-made skill sets. The two most discussed stacks are &lt;strong&gt;Superpowers&lt;/strong&gt; and &lt;strong&gt;gstack&lt;/strong&gt;. Installing both sounds easy; in practice they can &lt;strong&gt;conflict&lt;/strong&gt;, and piling frameworks on without a plan often makes the setup &lt;strong&gt;less&lt;/strong&gt; stable, not more. So where do they differ, and how should you choose?&lt;/p&gt;

&lt;p&gt;This post does three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Compare&lt;/strong&gt; Superpowers and gstack on repos, features, and philosophy—the material below on stars, skill lists, and trade-offs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a third layer&lt;/strong&gt; many guides skip: &lt;strong&gt;GSD&lt;/strong&gt; as a &lt;strong&gt;context / spec&lt;/strong&gt; stabilizer so long-running work does not drift (&lt;em&gt;informed by Tricia Notes Editorial’s three-layer framing&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;End with a single playbook&lt;/strong&gt;: who owns &lt;strong&gt;decision&lt;/strong&gt;, &lt;strong&gt;context&lt;/strong&gt;, and &lt;strong&gt;execution&lt;/strong&gt;, and how to cherry-pick skills without blowing up token use or cognitive load.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The useful question is not only “Superpowers &lt;strong&gt;or&lt;/strong&gt; gstack?” but: &lt;em&gt;what are you missing—decision-making, durable context, or execution?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In one line:&lt;/strong&gt; &lt;em&gt;gstack thinks, GSD stabilizes, Superpowers executes.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Orientation: Three Layers, Not Only Two
&lt;/h2&gt;

&lt;p&gt;What stays stable in practice is often &lt;strong&gt;not&lt;/strong&gt; picking one framework over another, but a &lt;strong&gt;three-way division of labor&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Stack&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Decision / roles&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;gstack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Judgment from CEO, design, architecture, QA-style lenses—not only “how to code.”&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context / spec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;GSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keeps spec, status, boundaries, and long-horizon context from rotting.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Superpowers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Requirement clarification → plan → TDD → acceptance as a &lt;strong&gt;closed loop&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;How each is “strong”:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; — &lt;strong&gt;How&lt;/strong&gt; work gets done; smooth execution loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; — &lt;strong&gt;What&lt;/strong&gt; to do and &lt;strong&gt;whether&lt;/strong&gt; it should be done; richer role-based judgment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD&lt;/strong&gt; — &lt;strong&gt;Not drifting&lt;/strong&gt;; steadier specs and context over long chains.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both Superpowers and gstack have gone viral. On the surface they add process to AI; in use, they help you &lt;strong&gt;think clearly about what matters&lt;/strong&gt;. When the model codes fast, that is exactly when you need clear requirements and stable context—&lt;strong&gt;that&lt;/strong&gt; is what most people still overlook.&lt;/p&gt;




&lt;h2&gt;
  
  
  Superpowers vs gstack: Quick Facts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers (GitHub ~137K stars)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Repository: &lt;strong&gt;obra/superpowers&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Agent Skills&lt;/strong&gt; framework and software development methodology: &lt;strong&gt;14 built-in skills&lt;/strong&gt; across brainstorming, planning, TDD, execution, and verification.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  gstack (GitHub ~65K stars)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Repository: &lt;strong&gt;garrytan/gstack&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;From &lt;strong&gt;YC CEO Garry Tan&lt;/strong&gt;, open source.&lt;/li&gt;
&lt;li&gt;Philosophy: a &lt;strong&gt;team&lt;/strong&gt; beside you—CEO, designer, eng manager, release manager, doc engineer, QA, and more—&lt;strong&gt;23 opinionated tools&lt;/strong&gt; (product thinking, CEO review, architecture review, real browser testing, design review, security audits, etc.).&lt;/li&gt;
&lt;li&gt;Garry has claimed &lt;strong&gt;600K+ lines of production code (35% tests) in 60 days&lt;/strong&gt;, part-time while running YC full-time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stars are a weak proxy: high star count does not mean every skill fits your workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature Comparison (Superpowers vs gstack)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;gstack&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Product brainstorming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;brainstorming&lt;/td&gt;
&lt;td&gt;/office-hours, /plan-ceo-review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Architecture planning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;writing-plans&lt;/td&gt;
&lt;td&gt;/plan-eng-review, /autoplan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Design&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/design-consultation, /plan-design-review, /design-shotgun, /design-html&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Development execution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;executing-plans, subagent-driven-development, dispatching-parallel-agents&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Testing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;test-driven-development&lt;/td&gt;
&lt;td&gt;/qa, /qa-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;systematic-debugging&lt;/td&gt;
&lt;td&gt;/investigate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code review&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;requesting-code-review, receiving-code-review&lt;/td&gt;
&lt;td&gt;/review, /codex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verification &amp;amp; acceptance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;verification-before-completion, finishing-a-development-branch&lt;/td&gt;
&lt;td&gt;/ship, /land-and-deploy, /canary, /document-release&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/cso, /careful, /freeze, /guard, /unfreeze&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/learn, /retro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser testing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/browse, /connect-chrome, /setup-browser-cookies&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Git worktrees&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;using-git-worktrees&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Skill management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;using-superpowers, writing-skills&lt;/td&gt;
&lt;td&gt;/gstack-upgrade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/benchmark&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;/setup-deploy&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Coverage differs a lot; &lt;strong&gt;quantity is not the point—design philosophy is.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Philosophy: “How” vs “What” (and Where GSD Fits)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers — focused on &lt;strong&gt;how&lt;/strong&gt; code gets built
&lt;/h3&gt;

&lt;p&gt;The workflow centers on &lt;strong&gt;high-quality output&lt;/strong&gt;: clarify, plan, &lt;strong&gt;TDD&lt;/strong&gt; (tests before implementation), verify. Checkpoints at each step—little room to skip. In practice it feels &lt;strong&gt;disciplined&lt;/strong&gt;: you ask for X, it tends to build X. Engineers who already know what to build often find that empowering.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(Execution-layer detail from hands-on use: strong process and steady execution; small tasks can still feel **heavy&lt;/em&gt;* because the full rhythm applies even to tiny asks.)*&lt;/p&gt;

&lt;h3&gt;
  
  
  gstack — focused on &lt;strong&gt;what&lt;/strong&gt; and &lt;strong&gt;what not&lt;/strong&gt; to do
&lt;/h3&gt;

&lt;p&gt;Before heavy coding, flows like &lt;strong&gt;/office-hours&lt;/strong&gt; walk requirements; &lt;strong&gt;CEO&lt;/strong&gt; and &lt;strong&gt;engineering&lt;/strong&gt; reviews stress-test the approach. It is not only code—it can &lt;strong&gt;run real browser tests&lt;/strong&gt; from a user angle. Rough split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision layer:&lt;/strong&gt; &lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution layer:&lt;/strong&gt; &lt;code&gt;/review&lt;/code&gt;, &lt;code&gt;/qa&lt;/code&gt;, &lt;code&gt;/ship&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;gstack shines when requirements are still fuzzy—PMs, indies, or “think while building.” Caveat: turning &lt;strong&gt;all&lt;/strong&gt; roles on can feel &lt;strong&gt;bloated&lt;/strong&gt;; decision skills also burn serious tokens (see below).&lt;/p&gt;

&lt;h3&gt;
  
  
  GSD — &lt;strong&gt;context / spec&lt;/strong&gt;, not another “team chart”
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GSD&lt;/strong&gt; is not “install another team.” It is &lt;strong&gt;context engineering&lt;/strong&gt;: goals, specs, status, boundaries, and summaries anchored so &lt;strong&gt;context rot&lt;/strong&gt; slows down. Short demos hide this; &lt;strong&gt;long&lt;/strong&gt; projects show it—when context wobbles, output scatters; that is &lt;strong&gt;state&lt;/strong&gt;, not only “bad execution.”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; thinks but is not, by itself, a long-term context vault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; executes but is not, by itself, a spec/context system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD&lt;/strong&gt; fills that gap so chains stay coherent.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Three-Way Comparison (Problems, Not “Who Wins”)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Superpowers&lt;/th&gt;
&lt;th&gt;gstack&lt;/th&gt;
&lt;th&gt;GSD&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core question&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How to get things done&lt;/td&gt;
&lt;td&gt;What to do; whether it should&lt;/td&gt;
&lt;td&gt;How to keep the project from diverging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Layer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;Decision / roles&lt;/td&gt;
&lt;td&gt;Context / spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Strongest fit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Planning, TDD, acceptance loop&lt;/td&gt;
&lt;td&gt;Multi-perspective judgment, review, QA&lt;/td&gt;
&lt;td&gt;Context engineering; stable state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear requirements&lt;/td&gt;
&lt;td&gt;Think-while-building&lt;/td&gt;
&lt;td&gt;Long chains / many iterations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Common pain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Front-loaded process can feel heavy (details below)&lt;/td&gt;
&lt;td&gt;Bloated and token-hungry when fully enabled (details below)&lt;/td&gt;
&lt;td&gt;Little standalone “shipping” value on its own (details below)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Role&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;execution&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;decision-making&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Own &lt;strong&gt;long-term context&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Common Pain Points in Detail
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Superpowers — front-loaded process can feel heavy.&lt;/strong&gt; Every task, no matter how small, runs through the full cycle: clarify requirements, draft a plan, write tests first, then implement, then verify. For a large feature this rhythm pays off handsomely. For a two-line config fix or a quick copy change, the same ceremony kicks in and you end up spending more time on process than on the actual change. The overhead does not scale down with task size, so small requests can feel disproportionately slow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;gstack — bloated and token-hungry when fully enabled.&lt;/strong&gt; Each gstack role (CEO, designer, architect, QA, etc.) injects its own perspective and prompts into the context. Turn them all on and a single execution-layer skill can consume &lt;strong&gt;10K+ tokens&lt;/strong&gt; before any real code is written. Daily usage burns through tokens fast, and the back-and-forth between multiple “virtual team members” can make even straightforward tasks feel sluggish and redundant. You may also encounter irrelevant meta-questions (e.g. “Are you applying to become a YC company?”) while your codebase is being scanned—artifacts of the framework’s opinionated persona layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GSD — little standalone “shipping” value.&lt;/strong&gt; GSD excels at keeping specs, goals, and state anchored across long sessions. But if you use it &lt;strong&gt;alone&lt;/strong&gt;, it does not directly produce code, run tests, or open a PR. It is a stabilizer, not a builder. Without an execution layer (Superpowers) or a decision layer (gstack) alongside it, GSD manages context that nothing acts on—useful plumbing, but no visible output. Its value only becomes apparent when paired with tools that actually ship work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical takeaway:&lt;/strong&gt; they are &lt;strong&gt;complements&lt;/strong&gt;, not substitutes—Superpowers executes, gstack decides, GSD stabilizes specs and context over time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strengths, Weaknesses, and Friction
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Superpowers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; Brainstorming and overall workflow feel solid; full process even on small asks can become smooth once habitual; execution and TDD are strong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Weaker spots are often &lt;strong&gt;early&lt;/strong&gt; decision skills (e.g. planning/brainstorming) compared to gstack’s decision layer—hence many people &lt;strong&gt;pair&lt;/strong&gt; gstack’s front end with Superpowers’ execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  gstack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strengths:&lt;/strong&gt; &lt;strong&gt;Decision layer&lt;/strong&gt;—&lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt;—stand out for positioning and approach review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Execution feels rougher vs Superpowers; token cost is real—&lt;strong&gt;a single execution-layer skill can cost 10K+ tokens&lt;/strong&gt;, and heavy scans can feel like noisy “process” rather than help.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The analogy
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Superpowers is a scalpel&lt;/strong&gt; — precise and efficient.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;gstack is a full clinic&lt;/strong&gt; — from diagnosis to aftercare.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Use the metaphor to choose depth: narrow execution vs full-spectrum product and review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Consolidated Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Choose skills deliberately—do not install everything
&lt;/h3&gt;

&lt;p&gt;Skill counts spiral easily (Superpowers today, gstack tomorrow, another stack next week). &lt;strong&gt;Selective deployment&lt;/strong&gt; beats volume; random invocation feels unstable and inflates surface-level “skill count” without clarity.&lt;/p&gt;

&lt;p&gt;Underlying idea: both stacks are experiments in &lt;strong&gt;Harness Engineering&lt;/strong&gt;. The mindset is &lt;strong&gt;leverage strengths, cover weaknesses&lt;/strong&gt;—not “I want it all.”&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Decision vs execution (the classic split)—then add context when needed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;gstack for the decision layer (cherry-picked):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prioritize high-value flows: e.g. &lt;code&gt;/office-hours&lt;/code&gt;, &lt;code&gt;/plan-ceo-review&lt;/code&gt;, &lt;code&gt;/plan-eng-review&lt;/code&gt; for requirements and alignment—avoid over-investing in every role.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Superpowers for the execution layer:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefer Superpowers as the &lt;strong&gt;base&lt;/strong&gt; for TDD, plans-as-executed, verification—optionally &lt;strong&gt;de-emphasize&lt;/strong&gt; its own heavy decision skills if gstack already covers that phase, so small tasks do not inherit double process.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GSD when the chain diverges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If work &lt;strong&gt;spreads&lt;/strong&gt; across sessions and threads, add &lt;strong&gt;GSD&lt;/strong&gt; so spec and state stay anchored—not for flash, &lt;strong&gt;for anti-drift&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Stable workflow (three steps)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Decision → gstack&lt;/strong&gt; — Start with &lt;code&gt;/office-hours&lt;/code&gt; to stress-test the idea, then run &lt;code&gt;/plan-ceo-review&lt;/code&gt; for a founder-level sanity check and &lt;code&gt;/plan-eng-review&lt;/code&gt; to lock architecture and data flow. If design matters, add &lt;code&gt;/plan-design-review&lt;/code&gt;. The goal: decide &lt;strong&gt;what&lt;/strong&gt; to build and &lt;strong&gt;whether&lt;/strong&gt; to build it before touching code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context → GSD&lt;/strong&gt; — Once the decision is made, use GSD (v2) to anchor the plan: &lt;code&gt;PROJECT.md&lt;/code&gt; for what the project is, &lt;code&gt;DECISIONS.md&lt;/code&gt; for architectural choices, &lt;code&gt;KNOWLEDGE.md&lt;/code&gt; for cross-session rules and patterns, and milestone roadmaps (&lt;code&gt;M001-ROADMAP.md&lt;/code&gt;) for sliced execution. These v2 artifacts keep spec, status, and boundaries stable so context does not rot between sessions. (The original GSD uses &lt;code&gt;REQUIREMENTS.md&lt;/code&gt;, &lt;code&gt;ROADMAP.md&lt;/code&gt;, and &lt;code&gt;STATE.md&lt;/code&gt; instead.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution → Superpowers&lt;/strong&gt; — With clear requirements and stable context in place, hand off to Superpowers’ execution loop: &lt;code&gt;brainstorming&lt;/code&gt; (if lightweight refinement is still needed), &lt;code&gt;writing-plans&lt;/code&gt; → &lt;code&gt;executing-plans&lt;/code&gt; for implementation, &lt;code&gt;test-driven-development&lt;/code&gt; for the RED-GREEN-REFACTOR cycle, &lt;code&gt;requesting-code-review&lt;/code&gt; / &lt;code&gt;receiving-code-review&lt;/code&gt; for review, and &lt;code&gt;verification-before-completion&lt;/code&gt; → &lt;code&gt;finishing-a-development-branch&lt;/code&gt; to close the loop. For parallel work, use &lt;code&gt;dispatching-parallel-agents&lt;/code&gt; or &lt;code&gt;subagent-driven-development&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Merged tagline:&lt;/strong&gt; &lt;em&gt;gstack handles thinking, Superpowers handles doing, GSD keeps long context honest.&lt;/em&gt; Combining the &lt;strong&gt;strong decision slice&lt;/strong&gt; of gstack with &lt;strong&gt;Superpowers’ execution&lt;/strong&gt; (and GSD when needed) keeps skill count and collisions under control—similar to the author’s experience building a small tool on a weekend with a &lt;strong&gt;curated&lt;/strong&gt; mix.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Final heuristics
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Requirements still fuzzy → &lt;strong&gt;start with gstack&lt;/strong&gt; (decision).&lt;/li&gt;
&lt;li&gt;Work keeps diverging across the chain → &lt;strong&gt;add GSD&lt;/strong&gt; (context).&lt;/li&gt;
&lt;li&gt;You want execution &lt;strong&gt;steady and closed-loop&lt;/strong&gt; → &lt;strong&gt;lean on Superpowers&lt;/strong&gt; (execution).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Stop asking only:&lt;/strong&gt; “Superpowers or gstack?” &lt;strong&gt;Ask:&lt;/strong&gt; &lt;em&gt;Am I missing decision, context, or execution?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing:
&lt;/h2&gt;

&lt;p&gt;Skills are not stronger because you install more—they are stronger when you &lt;strong&gt;combine the right pieces for the gap you actually have&lt;/strong&gt; and understand what each layer does, then assemble a workflow that is yours.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Superpowers&lt;/strong&gt; — &lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;github.com/obra/superpowers&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gstack&lt;/strong&gt; — &lt;a href="https://github.com/garrytan/gstack" rel="noopener noreferrer"&gt;github.com/garrytan/gstack&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GSD (Get Shit Done)&lt;/strong&gt; — &lt;a href="https://github.com/gsd-build/get-shit-done" rel="noopener noreferrer"&gt;github.com/gsd-build/get-shit-done&lt;/a&gt; (original) | &lt;a href="https://github.com/gsd-build/gsd-2" rel="noopener noreferrer"&gt;github.com/gsd-build/gsd-2&lt;/a&gt; (v2, standalone CLI)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>sde</category>
      <category>claude</category>
    </item>
    <item>
      <title>From IDE to AGaaS: How Cursor Cloud Agents Bring the OpenClaw Model to Your Slack</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Tue, 24 Mar 2026 00:07:17 +0000</pubDate>
      <link>https://dev.to/imaginex/from-ide-to-agaas-how-cursor-cloud-agents-bring-the-openclaw-model-to-your-slack-4547</link>
      <guid>https://dev.to/imaginex/from-ide-to-agaas-how-cursor-cloud-agents-bring-the-openclaw-model-to-your-slack-4547</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Cursor's &lt;strong&gt;Cloud Agents&lt;/strong&gt; let you delegate coding tasks — bug fixes, feature work, test writing — directly from a Slack message. The agent spins up a remote VM, clones your repo, writes the code, runs your tests, and opens a Pull Request on GitHub. You never open an IDE. This post walks you through the full setup — from Slack integration to your first hands-off pull request — and then examines where the technology shines, where it falls short, and where the AGaaS market is heading next.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is the OpenClaw Model — and Why Should You Care?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenClaw&lt;/strong&gt; refers to an emerging paradigm in AI-assisted development where a cloud-hosted coding agent operates &lt;em&gt;autonomously and headlessly&lt;/em&gt; — meaning it doesn't need a local IDE, a human at the keyboard, or even a screen. You give it a task in natural language, and it handles the full software development lifecycle (clone → code → test → commit → PR) on its own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AGaaS (Agent-as-a-Service)&lt;/strong&gt; is the broader industry term for this pattern: instead of installing AI tooling locally, you interact with a managed agent through everyday interfaces like Slack, Teams, or a web dashboard.&lt;/p&gt;

&lt;p&gt;Cursor's Cloud Agents are a production-ready implementation of this model. If you're already using Cursor as your IDE, you can now step &lt;em&gt;outside&lt;/em&gt; the IDE entirely and operate as a manager — assigning tasks from Slack and reviewing the output as Pull Requests.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Cloud Agents Work Under the Hood?
&lt;/h2&gt;

&lt;p&gt;Before diving into setup, here's what happens when you type &lt;code&gt;@Cursor revise the README.md file to make it more professional and beginner-friendly&lt;/code&gt; in Slack:&lt;/p&gt;

&lt;h3&gt;
  
  
  Headless Execution on Isolated VMs
&lt;/h3&gt;

&lt;p&gt;Traditionally, Cursor ran locally — consuming your RAM, competing for your CPU. Cloud Agents move the execution layer to a remote, isolated Virtual Machine. When a task is triggered, the agent provisions a sandboxed VM, clones your GitHub repository into it, and does all the work in the background. Your local machine stays completely free.&lt;/p&gt;

&lt;p&gt;Each VM comes pre-loaded with a production-grade development environment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Specification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04.4 LTS (Noble Numbat), Linux kernel 6.12.58+, x86_64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hardware&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4 CPU cores, 15 GB RAM, ~126 GB disk (overlay filesystem)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Runtimes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.12.3, Node.js v22.22.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Toolchain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Git 2.43.0, GitHub CLI 2.81.0, Bash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workspace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your repo cloned at &lt;code&gt;/workspace&lt;/code&gt;, running as user &lt;code&gt;ubuntu&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You can verify this yourself by asking the agent about its environment. Here's what that looks like in a real Slack conversation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gj4r7scskjganscyjqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0gj4r7scskjganscyjqy.png" alt=" " width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Slack Thread as Context Window
&lt;/h3&gt;

&lt;p&gt;This isn't a basic chatbot that only reads your one-line prompt. Cursor's Slack integration behaves like a teammate who's been reading the whole conversation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your team has been discussing a bug in a thread — sharing stack traces, debating approaches, pasting logs — the agent ingests &lt;em&gt;all of it&lt;/em&gt; when you tag &lt;code&gt;@Cursor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;It synthesizes the thread context and implements a fix that reflects the team's consensus, not just your single message.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Autonomous Testing via "Computer Use"
&lt;/h3&gt;

&lt;p&gt;Because the agent has its own VM with a full desktop environment, it doesn't just write code and hope for the best:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It can start your dev server, open a headless browser, and click through UI flows to visually verify the fix.&lt;/li&gt;
&lt;li&gt;If tests fail or the UI breaks, it self-corrects &lt;em&gt;before&lt;/em&gt; submitting the Pull Request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that you understand what's happening behind the scenes, let's set it up. The whole process takes about 15 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step-by-Step Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before you begin, make sure you have the following in place:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Details&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor subscription&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Agents require a paid plan — &lt;strong&gt;Pro&lt;/strong&gt; ($20/mo), &lt;strong&gt;Pro+&lt;/strong&gt;, &lt;strong&gt;Ultra&lt;/strong&gt;, or &lt;strong&gt;Teams&lt;/strong&gt;. Check your plan at &lt;a href="https://cursor.com/en-US/pricing" rel="noopener noreferrer"&gt;cursor.com/pricing&lt;/a&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GitHub account&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your repository must be hosted on GitHub or GitLab. You need read-write access to the repo.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Slack workspace&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You need &lt;strong&gt;admin permissions&lt;/strong&gt; (or the ability to request app installation) in your Slack workspace.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Existing test suite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recommended but not required. The agent can run your tests automatically if they exist (e.g., &lt;code&gt;npm test&lt;/code&gt;, &lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;go test&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Step 1: Connect Slack to Cursor
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open the &lt;strong&gt;Cursor Dashboard&lt;/strong&gt; at &lt;a href="https://cursor.com/dashboard" rel="noopener noreferrer"&gt;cursor.com/dashboard&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Navigate to the &lt;strong&gt;Integrations &amp;amp; MCP&lt;/strong&gt; tab.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Connect&lt;/strong&gt; next to &lt;strong&gt;Slack&lt;/strong&gt;. This launches an OAuth flow that installs the Cursor bot into your Slack workspace.&lt;/li&gt;
&lt;li&gt;Authorize the requested permissions (read messages in channels where the bot is invited, post replies).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 2: Connect Your GitHub Repository
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;In the same Dashboard, go to the &lt;strong&gt;Cloud Agents &amp;gt; Default Repositories &amp;gt; Manage Repositories&lt;/strong&gt; section.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Repository&lt;/strong&gt; and authenticate with GitHub.&lt;/li&gt;
&lt;li&gt;Select the repository (or repositories) you want the Cloud Agent to access.&lt;/li&gt;
&lt;li&gt;Grant the agent permission to &lt;strong&gt;create branches&lt;/strong&gt; and &lt;strong&gt;open Pull Requests&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 3: Configure the Cloud Agent Environment
&lt;/h3&gt;

&lt;p&gt;Before triggering tasks from Slack, configure the Cloud Agent's development environment and defaults in the Cursor dashboard. Navigate to &lt;strong&gt;Cloud Agents&lt;/strong&gt; in the left sidebar.&lt;/p&gt;

&lt;h4&gt;
  
  
  3a. Set Your Defaults
&lt;/h4&gt;

&lt;p&gt;Under the &lt;strong&gt;My Settings&lt;/strong&gt; tab, configure the following:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;What It Controls&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The AI model the agent uses when no model is specified in the task. Higher-tier models produce better code.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Opus 4.6 High Fast&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Default Repository&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The GitHub repo the agent targets when no repo is mentioned in the Slack message.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chen115y/MLOpsLearning&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Base Branch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The branch the agent creates feature/fix branches from. Leave empty to use the repo's default branch.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;main&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Branch Prefix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prepended to every branch the agent creates, making agent-authored branches easy to filter.&lt;/td&gt;
&lt;td&gt;&lt;code&gt;cursor/&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1v6x13qguzkeegrwliw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1v6x13qguzkeegrwliw.png" alt=" " width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3b. Set Up a Development Environment
&lt;/h4&gt;

&lt;p&gt;For repositories with complex dependencies (Python data-science stacks, system libraries, database services), click &lt;strong&gt;Add Environment&lt;/strong&gt; button. This launches a very simple setup agent that provisions and validates the VM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faost6duggi01rkhw8i2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faost6duggi01rkhw8i2g.png" alt=" " width="800" height="654"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once all fields are filled, click &lt;strong&gt;Start For Free&lt;/strong&gt; to start the VM provisioning. The setup agent will analyze the repository and provision the VM accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn28ln49e8e22bd6nh9fo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn28ln49e8e22bd6nh9fo.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; You can add multiple environments for different repos. If the setup agent reports warnings (e.g., deprecated API calls in older notebooks), these are pre-existing code issues, not environment problems — the snapshot is still safe to save.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Create a Channel and Invite the Bot
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;In Slack, create a dedicated channel for agent-assisted work (e.g., &lt;code&gt;#engineering-triage&lt;/code&gt;, &lt;code&gt;#cursor-tasks&lt;/code&gt;, or &lt;code&gt;#bug-reports&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Simply mention &lt;code&gt;@Cursor&lt;/code&gt; in the channel with any prompt — the bot joins automatically when the Slack app is installed (Step 1). No separate invite is needed.&lt;/li&gt;
&lt;li&gt;You can also type &lt;code&gt;@Cursor help&lt;/code&gt; to see available commands, or &lt;code&gt;@Cursor settings&lt;/code&gt; to configure channel-level defaults.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Configure Cursor Rules (the Agent's Playbook)
&lt;/h3&gt;

&lt;p&gt;This is the most important step. Without rules, the agent will make reasonable guesses about your codebase conventions. With rules, it follows your team's standards precisely.&lt;/p&gt;

&lt;p&gt;Create a &lt;code&gt;.cursor/rules/triage.mdc&lt;/code&gt; file in your repository root, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rules&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;for&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Slack-triggered&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;bug&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;triage&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;feature&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tasks"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*"&lt;/span&gt;
&lt;span class="na"&gt;alwaysApply&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;# Agent Behavior for Slack Tasks&lt;/span&gt;

&lt;span class="c1"&gt;## Bug Triage Protocol&lt;/span&gt;
&lt;span class="s"&gt;1. Read the full Slack thread for context, including any error logs or stack traces.&lt;/span&gt;
&lt;span class="s"&gt;2. Search the codebase to locate the relevant source files.&lt;/span&gt;
&lt;span class="s"&gt;3. Identify the root cause before writing any fix.&lt;/span&gt;
&lt;span class="s"&gt;4. Write the fix following existing code patterns in the repository.&lt;/span&gt;
&lt;span class="s"&gt;5. Use the project's standard error-handling approach (check for existing wrappers).&lt;/span&gt;

&lt;span class="c1"&gt;## Testing Requirements&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Run the full test suite&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="s"&gt;npm run test` (or the project's equivalent).&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If no tests exist for the changed code, write at least one unit test covering the fix.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not submit a PR if tests fail. Debug and fix until green.&lt;/span&gt;

&lt;span class="c1"&gt;## Git and PR Conventions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Create a new branch from `main` with the format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="err"&gt;`&lt;/span&gt;&lt;span class="s"&gt;fix/&amp;lt;short-description&amp;gt;`.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never push directly to `main` or `develop`.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;PR title format: `fix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;&amp;lt;concise description of the change&amp;gt;`&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Include a summary of the root cause and fix in the PR description.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Reply to the original Slack thread with the PR link and a brief explanation.&lt;/span&gt;

&lt;span class="c1"&gt;## Out of Scope&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not modify CI/CD configuration files without explicit approval.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Do not upgrade dependencies unless the fix requires it.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If the issue is unclear, ask clarifying questions in the Slack thread before proceeding.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create additional rule files for different workflows — feature development, refactoring, documentation — each with its own conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Run Your First Agent Task
&lt;/h3&gt;

&lt;p&gt;With everything connected, you're ready to give the agent its first job. Post a message in your channel (or reply in an existing thread) and tag &lt;code&gt;@Cursor&lt;/code&gt; with a clear task description. The agent picks it up, executes the work on its remote VM, and reports back — all within the same Slack thread.&lt;/p&gt;

&lt;p&gt;Here's a real example. A user asks the agent to revise a repository's README to make it more professional and beginner-friendly. Within minutes, the agent replies with a structured breakdown of every change it made — reorganized navigation, plain-language introductions, typo fixes, new formatting — along with the commit diff (+338 / -190 lines):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf9deq0twbkaj30move3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf9deq0twbkaj30move3.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The user asks the agent to make a commit and push the changes directly to the remote repository. Once the work is done, the agent confirms it has committed and pushed the changes directly to the remote repository, and provides a link to verify on GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnl6hdfa8mjlbv34n6sw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnl6hdfa8mjlbv34n6sw.png" alt=" " width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Want to see how the agent reasoned through the task? Click the &lt;strong&gt;"Open in Web"&lt;/strong&gt; button in the Slack message to open the full agent session. This view shows the agent's step-by-step thought process — the file diff it analyzed, the to-do list it created for itself (commit, push), and the detailed revision plan it followed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8tbt7ruxda7w3oavm6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8tbt7ruxda7w3oavm6d.png" alt=" " width="800" height="946"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And to close the loop, here's the GitHub repository immediately after. Notice the &lt;code&gt;README.md&lt;/code&gt; row — updated "1 minute ago" by &lt;code&gt;cursoragent&lt;/code&gt; with the commit message matching exactly what the agent described in Slack:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F467evach5czpkqns5sek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F467evach5czpkqns5sek.png" alt=" " width="800" height="733"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No IDE opened. No branch created manually. No code written by hand. One Slack message in, a polished commit out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Writing Effective Cursor Rules: A Deeper Look
&lt;/h2&gt;

&lt;p&gt;The example above worked smoothly because the task was straightforward. But as you start assigning more complex work — multi-file refactors, feature additions, cross-cutting bug fixes — the quality of the agent's output depends heavily on how well you've defined your team's standards. That's where Cursor Rules go from "nice to have" to essential.&lt;/p&gt;

&lt;p&gt;Step 5 introduced the basic format. Here we'll look at patterns that make rules genuinely effective at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope rules by file type.&lt;/strong&gt; Use the &lt;code&gt;globs&lt;/code&gt; field to apply different rules to different parts of your codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Frontend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;component&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;conventions"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/components/**/*.tsx"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use functional components with hooks, never class components.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;All components must have a corresponding .test.tsx file.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Use the project's design system tokens for colors and spacing.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;route&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;conventions"&lt;/span&gt;
&lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/api/**/*.ts"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Validate all request bodies with zod schemas.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Return consistent error response shapes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;error&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;string&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;number&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Log errors with the structured logger, not console.log.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Be specific about what the agent should &lt;em&gt;not&lt;/em&gt; do.&lt;/strong&gt; Guardrails prevent expensive mistakes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;## Boundaries&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never delete database migration files.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Never modify environment variable files (.env, .env.local).&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;If a change requires more than 5 files, stop and ask for confirmation in Slack.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point you have the full toolkit: the agent is connected, the environment is configured, and the rules are in place. But having the setup working and knowing where to &lt;em&gt;rely&lt;/em&gt; on it are two different things. Let me share what I've learned from using this in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Cloud Agents Shine — and Where They Don't (Yet)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Real Unlock: Work Anytime, Anywhere
&lt;/h3&gt;

&lt;p&gt;Here's what changed my daily workflow more than any single feature: I no longer need to be at my desk, or even awake, for code to get written.&lt;/p&gt;

&lt;p&gt;Think about that for a moment. It's 11 PM and a teammate in another timezone drops a bug report in Slack with a Datadog trace attached. Before Cloud Agents, that bug sat untouched until someone opened their laptop the next morning, cloned the repo, reproduced the issue, wrote the fix, ran the tests, and pushed a PR. That's a minimum 30-minute context-switch tax — and that's if the person was already familiar with the code.&lt;/p&gt;

&lt;p&gt;Now? I glance at the Slack notification on my phone, type &lt;code&gt;@Cursor investigate and fix this&lt;/code&gt;, and go back to sleep. By morning, there's a PR waiting for review with a clear explanation of the root cause. The agent read the error trace, found the offending line, wrote the fix, confirmed the tests pass, and opened the PR — all while I was unconscious.&lt;/p&gt;

&lt;p&gt;This isn't just about convenience. It fundamentally changes when and where software development can happen. You can triage bugs from an airport lounge with nothing but your phone. You can delegate a documentation overhaul while you're deep in a design review. You can assign test-writing tasks to the agent on Friday afternoon and come back Monday to a PR that covers the gaps you've been meaning to address for weeks. The agent doesn't get tired, doesn't lose context, and doesn't need to "get back into the zone" after lunch.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Agent Handles Well Today
&lt;/h3&gt;

&lt;p&gt;The sweet spot for Cloud Agents is any task where the goal is clearly defined and the scope is contained. Bug fixes are the most natural fit — especially when someone has already done the diagnostic work and there's an error trace, a stack dump, or a reproduction path sitting in the Slack thread. The agent can read that context, locate the relevant source files, and produce a targeted fix without anyone needing to spell out which file to open. It's remarkably good at this.&lt;/p&gt;

&lt;p&gt;Test coverage is another area where the agent earns its keep. Most teams know they should be writing more tests, but nobody &lt;em&gt;wants&lt;/em&gt; to write the fifteenth unit test for a utility function. Hand that to the agent. It reads the existing code, infers the expected behavior, and generates tests that follow whatever patterns your codebase already uses — &lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;jest&lt;/code&gt;, &lt;code&gt;go test&lt;/code&gt;, you name it. It's not glamorous work, but it's exactly the kind of high-value, low-creativity task that agents are built for.&lt;/p&gt;

&lt;p&gt;Small-to-medium feature additions work well too, as long as the spec is clear. "Add a CSV export button to the billing page that calls the existing &lt;code&gt;exportService&lt;/code&gt;" is a great agent task. "Make the app feel more modern" is not — that requires taste, iteration, and subjective judgment that the agent can't provide.&lt;/p&gt;

&lt;p&gt;The same applies to code refactoring. If you can describe the before and after state clearly — "rename all instances of &lt;code&gt;getUserData&lt;/code&gt; to &lt;code&gt;fetchUserProfile&lt;/code&gt; across the codebase" or "extract the validation logic from the controller into a dedicated middleware" — the agent will handle it methodically and consistently. And documentation updates? The agent writes clean, structured prose. Give it a README that's fallen out of date, and it'll cross-reference the actual codebase to produce documentation that matches reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where You Still Need the IDE
&lt;/h3&gt;

&lt;p&gt;That said, Cloud Agents aren't a replacement for sitting down with your code — at least not yet. There are categories of work where human judgment, rapid iteration, and architectural intuition still matter more than raw execution speed.&lt;/p&gt;

&lt;p&gt;Large architectural changes are the clearest example. If a task spans multiple services, touches database schemas, modifies CI/CD pipelines, and requires coordinating changes across a dozen files in a specific order, the agent can get lost. It doesn't have the mental model of your system's dependency graph that you've built up over months of working in the codebase. It might fix one file in a way that breaks three others, then chase its tail fixing those. For these tasks, you want a human architect in the driver's seat, possibly &lt;em&gt;using&lt;/em&gt; the agent for individual sub-tasks, but directing the overall strategy.&lt;/p&gt;

&lt;p&gt;Exploratory prototyping is another area where the agent falls short. When you're experimenting — trying out a new library, playing with different UI layouts, iterating on an API design — you need a tight feedback loop. You write a few lines, run it, see what happens, change direction, try something else. That back-and-forth is the creative engine of prototyping, and it doesn't translate well to "write a Slack message and wait for a PR." The latency alone kills the creative flow.&lt;/p&gt;

&lt;p&gt;Security-sensitive code deserves human eyes, full stop. The agent can write functionally correct authentication logic, but it won't catch the subtle timing-attack vulnerability or the OAuth misconfiguration that a security-conscious engineer would flag during a manual review. Use the agent to write the boilerplate, but review every line yourself before it touches production auth flows.&lt;/p&gt;

&lt;p&gt;And anything requiring visual design judgment — pixel-perfect UI work, animation tuning, responsive layout decisions — still demands a human with a browser open, resizing windows, and squinting at spacing. The agent can generate the JSX and CSS, but it can't tell you whether the result &lt;em&gt;feels&lt;/em&gt; right.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making the Most of Imperfect Results
&lt;/h3&gt;

&lt;p&gt;Here's a practical pattern that works well: &lt;strong&gt;the 90% handoff.&lt;/strong&gt; The agent doesn't need to produce a perfect PR every time. If it gets 90% of the way there — the logic is right but it missed an edge case, or the implementation is solid but the naming isn't quite what you'd choose — you can pull the agent's remote session directly into your local Cursor IDE and finish the last stretch yourself. You don't start over. You continue right where the agent left off, with all the files already modified and the context preserved.&lt;/p&gt;

&lt;p&gt;And when the agent goes in the wrong direction entirely? Course-correct in the same Slack thread. Reply with something like &lt;code&gt;@Cursor stop. The issue is in the middleware, not the controller. Look at src/middleware/auth.ts instead.&lt;/code&gt; The agent re-reads the full thread, incorporates your feedback, and adjusts its approach. Think of it less like a tool that either works or doesn't, and more like a junior developer who's fast and tireless but occasionally needs steering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Going Further: MCP Integrations for Closed-Loop Automation
&lt;/h2&gt;

&lt;p&gt;So far, every workflow in this post has followed the same pattern: a human writes a Slack message, the agent does the work, and a PR appears on GitHub. That's already powerful — but it still requires someone to initiate each task. What if the agent could respond to events across your entire toolchain without waiting for a Slack prompt?&lt;/p&gt;

&lt;p&gt;That's where the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt; comes in. MCP lets the agent interact with external tools beyond Slack and GitHub. By adding MCP servers, you can build a fully closed-loop system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jira / Linear:&lt;/strong&gt; The agent automatically creates a ticket, links it to the PR, and transitions the issue status.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Datadog / Sentry:&lt;/strong&gt; The agent queries your monitoring tools directly to pull error traces without anyone needing to paste them into Slack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confluence / Notion:&lt;/strong&gt; The agent updates your team's documentation when it changes an API contract.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns the workflow from a &lt;em&gt;Slack → PR&lt;/em&gt; pipeline into a &lt;em&gt;Slack → Ticket → PR → Docs → Status Update&lt;/em&gt; pipeline — with zero manual handoff.&lt;/p&gt;

&lt;p&gt;MCP integrations are where Cloud Agents start to feel less like a developer tool and more like infrastructure. And that shift — from tool to infrastructure — is exactly what's happening across the industry.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Road Ahead: AGaaS and Where the Market Is Going
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From Novelty to Infrastructure
&lt;/h3&gt;

&lt;p&gt;What Cursor has shipped with Cloud Agents is impressive, but it's also clearly early. If you zoom out from the specifics of this one product, a much larger shift is taking shape: &lt;strong&gt;Agent-as-a-Service (AGaaS)&lt;/strong&gt; is becoming a real infrastructure category, not just a buzzword.&lt;/p&gt;

&lt;p&gt;The core idea is straightforward — instead of every developer installing AI tooling on their local machine and managing prompts, context windows, and model versions themselves, you subscribe to a managed agent that lives in the cloud, integrates with your existing tools, and operates autonomously on your behalf. Cursor is one implementation, but the pattern is bigger than any single vendor.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Customers Actually Need (and What's Missing)
&lt;/h3&gt;

&lt;p&gt;If you've followed along with this post and tried the setup yourself, you've probably already noticed a few gaps. These aren't criticisms — they're the natural rough edges of a category that's still being defined. But they point directly at where the market is heading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-repository orchestration.&lt;/strong&gt; Today, each Cloud Agent task targets a single repository. But real-world features often span a frontend repo, a backend API, a shared library, and an infrastructure-as-code repo. The next generation of AGaaS platforms will need to coordinate changes across multiple repos atomically — opening linked PRs that reference each other and can be merged together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent agent memory.&lt;/strong&gt; Right now, each task starts fresh. The agent doesn't remember that it fixed a similar bug last week, or that your team prefers a particular error-handling pattern, or that the last three PRs it opened for this repo all needed the same test fixture adjustment. Future agents will build a persistent understanding of your codebase, your team's preferences, and your project's history — getting better at their job over time, just like a human teammate does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Richer feedback loops beyond Slack.&lt;/strong&gt; Slack is a natural starting point because it's where engineering teams already communicate. But imagine triggering agent tasks from a Jira ticket transition, a Sentry alert threshold, a failing CI check, or a monitoring dashboard anomaly. The agent becomes a first-responder that patches issues before a human even notices them. Some of this is possible today through MCP integrations, but it's still manual plumbing — it should be turnkey.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customizable execution environments at scale.&lt;/strong&gt; The environment setup flow shown in Step 3 is a solid start, but enterprise teams need more. Think GPU-enabled VMs for ML codebases, pre-configured database fixtures for integration testing, VPN access to internal services, and compliance-scoped environments that restrict which external packages the agent can install. As AGaaS matures, the execution environment will need to match the complexity of real enterprise infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost transparency and resource governance.&lt;/strong&gt; When an agent spins up a VM, runs your test suite, and interacts with a paid AI model for 15 minutes, who pays for what? Teams need clear visibility into per-task cost breakdowns — compute, model tokens, API calls — and the ability to set budgets, quotas, and approval gates for expensive operations. This is table stakes for enterprise adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Market Convergence
&lt;/h3&gt;

&lt;p&gt;It's worth noting that Cursor isn't the only player moving in this direction. GitHub Copilot has introduced its own agent mode. Amazon Q Developer (formerly CodeWhisperer) has evolved toward autonomous capabilities. Smaller players like Devin, Cosine, and Factory are building agent-first platforms from scratch. The competitive pressure is accelerating the category.&lt;/p&gt;

&lt;p&gt;What's emerging is a spectrum: at one end, lightweight copilot-style suggestions embedded in your editor; at the other end, fully autonomous agents that operate headlessly across your entire development workflow. Most teams will use both, for different tasks, at different times. The interesting question isn't &lt;em&gt;which&lt;/em&gt; tool wins — it's how the boundaries between human-driven and agent-driven work shift over the next two to three years.&lt;/p&gt;

&lt;p&gt;For engineering leaders, the strategic play is clear: start experimenting now. The teams that build fluency with agent-assisted workflows today — who learn which tasks to delegate, how to write effective agent rules, and how to review agent-produced code efficiently — will have a significant velocity advantage as these tools mature.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/cloud-agent" rel="noopener noreferrer"&gt;Cursor Cloud Agents Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/integrations/slack" rel="noopener noreferrer"&gt;Cursor Docs: Slack Integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/rules" rel="noopener noreferrer"&gt;Cursor Rules Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cursor.com/docs/context/mcp" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>automation</category>
      <category>openclaw</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>From Prompts to Real Files: A Developer's Guide to AI File Generation</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 16 Mar 2026 22:11:50 +0000</pubDate>
      <link>https://dev.to/imaginex/your-llm-can-write-files-now-4c6e</link>
      <guid>https://dev.to/imaginex/your-llm-can-write-files-now-4c6e</guid>
      <description>&lt;p&gt;Ask ChatGPT to "create a sales report PDF with a revenue chart." A year ago, it would paste some markdown and wish you luck. Today, it spins up a sandboxed Python environment, runs &lt;code&gt;reportlab&lt;/code&gt; and &lt;code&gt;matplotlib&lt;/code&gt;, and hands you a real, downloadable PDF file.&lt;/p&gt;

&lt;p&gt;This is the shift from &lt;strong&gt;text generation&lt;/strong&gt; to &lt;strong&gt;artifact generation&lt;/strong&gt; -- and every major LLM vendor now supports it through their API. Claude, OpenAI, and Gemini each give developers a way to prompt an LLM and get back actual files: PDFs, spreadsheets, charts, slide decks, whatever you can create with Python.&lt;/p&gt;

&lt;p&gt;This post walks through the universal pattern behind file generation, then shows you exactly how to do it with each vendor -- working code included.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Universal Pattern
&lt;/h2&gt;

&lt;p&gt;Despite different APIs, all three vendors follow the same three-step architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyzbmd5nj5hz3dx9j4hm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyzbmd5nj5hz3dx9j4hm.png" alt=" " width="800" height="42"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every vendor-specific implementation is a variation on this flow. The details change, but three concepts repeat everywhere:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tool declaration&lt;/strong&gt; -- you opt in to code execution by including a specific tool in your API request. It's never on by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxed execution&lt;/strong&gt; -- the LLM's code runs in an isolated container with no internet access. Common libraries (pandas, matplotlib, reportlab) come pre-installed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File retrieval&lt;/strong&gt; -- each vendor has a different mechanism to get the bytes out. Some give you a file ID to download; others return bytes inline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you internalize this pattern, learning any vendor's API is just a matter of mapping it to these three steps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claude: Code Execution + Files API
&lt;/h2&gt;

&lt;p&gt;Claude's file generation is the most full-featured option for document creation. It provides a persistent container with full bash access, a rich set of pre-installed document libraries, and a clean Files API for uploads and downloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a PDF from a Prompt
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_execution_20250825&lt;/code&gt; tool, send your prompt, then extract file IDs from the response and download them through the Files API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Request with code execution enabled
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Create a one-page PDF sales report with a revenue chart for Q1 2026.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract file IDs from the response
&lt;/span&gt;&lt;span class="n"&gt;file_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;file_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Download each generated file
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;file_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response content blocks have a nested structure: you're looking for &lt;code&gt;bash_code_execution_tool_result&lt;/code&gt; blocks, which contain &lt;code&gt;bash_code_execution_result&lt;/code&gt; objects, which contain items with &lt;code&gt;file_id&lt;/code&gt; attributes. The &lt;code&gt;files.download()&lt;/code&gt; call gives you the raw bytes; &lt;code&gt;retrieve_metadata()&lt;/code&gt; gives you the original filename.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;bash_code_execution&lt;/code&gt;? When you include the &lt;code&gt;code_execution_20250825&lt;/code&gt; tool, Claude actually gets two sub-tools: &lt;code&gt;bash_code_execution&lt;/code&gt; (run shell commands) and &lt;code&gt;text_editor_code_execution&lt;/code&gt; (create and edit files). To generate a file, Claude typically writes a Python script with the text editor sub-tool, then runs it via bash. The result block is named after whichever sub-tool produced the output -- and since it's the bash execution that creates the final file, that's the block type you parse. This is also why Claude has full bash access unlike the other vendors: it's not running Python in a restricted interpreter, it's executing real shell commands. The &lt;code&gt;_20250825&lt;/code&gt; tool version introduced this bash/text-editor split, replacing the earlier &lt;code&gt;_20250522&lt;/code&gt; version that was Python-only.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uploading a CSV, Getting Back a Chart + PDF
&lt;/h3&gt;

&lt;p&gt;To process your own data, upload via the Files API first, then attach the file to your prompt alongside the code execution tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Upload your input file
&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Send the file + prompt with code execution
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;betas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;files-api-2025-04-14&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this sales CSV. Create a bar chart of revenue by region &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and save it as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Also generate a one-page PDF &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary report of the key findings.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_upload&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Download all generated files
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bash_code_execution_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;metadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A single prompt can produce multiple files. In this case, you'll get both the PNG chart and the PDF report. Always iterate the full response -- never assume a single file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Container Reuse: The Key to Iteration Workflows&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude containers persist for &lt;strong&gt;30 days&lt;/strong&gt;. When your first request creates a container, the response includes a &lt;code&gt;container.id&lt;/code&gt;. Pass it to subsequent calls and Claude picks up right where it left off -- all files from the previous request are still on disk.&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# First call creates the container
&lt;/span&gt;&lt;span class="n"&gt;response1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a sales report PDF.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;container_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;

&lt;span class="c1"&gt;# Subsequent calls reuse the same container
&lt;/span&gt;&lt;span class="n"&gt;response2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Update the chart on page 2 to use a pie chart instead.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution_20250825&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;This enables "conversational file editing" -- users can iterate on documents without re-uploading data or starting from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-installed Libraries&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's sandbox comes with the document generation essentials: &lt;code&gt;reportlab&lt;/code&gt; (PDFs), &lt;code&gt;python-docx&lt;/code&gt; (Word), &lt;code&gt;python-pptx&lt;/code&gt; (PowerPoint), &lt;code&gt;openpyxl&lt;/code&gt; (Excel), &lt;code&gt;pandas&lt;/code&gt;, &lt;code&gt;matplotlib&lt;/code&gt;, &lt;code&gt;pillow&lt;/code&gt;, &lt;code&gt;pypdf&lt;/code&gt;, &lt;code&gt;pdfplumber&lt;/code&gt;, &lt;code&gt;seaborn&lt;/code&gt;, &lt;code&gt;scipy&lt;/code&gt;, and &lt;code&gt;scikit-learn&lt;/code&gt;. Since Claude has full bash access, you can also &lt;code&gt;pip install&lt;/code&gt; anything else you need during the session.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  OpenAI: Responses API + Code Interpreter
&lt;/h2&gt;

&lt;p&gt;OpenAI's Responses API (the successor to the deprecated Assistants API) uses the &lt;strong&gt;Code Interpreter&lt;/strong&gt; tool for file generation. The pattern is similar to Claude, but the response structure and file retrieval mechanism differ.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a CSV with Code Interpreter
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_interpreter&lt;/code&gt; tool, then parse &lt;code&gt;container_file_citation&lt;/code&gt; annotations from the response to find generated files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Request with code interpreter enabled
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a CSV file named &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;q1_report.csv&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; with 10 rows of financial data.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract file references from annotations
# The response structure nests deep: output → message → content → output_text → annotations
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_file_citation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="c1"&gt;# Step 3: Download from the container endpoint
&lt;/span&gt;                        &lt;span class="n"&gt;file_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                            &lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;
                        &lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The annotation traversal is the trickiest part. Don't try to shortcut it with &lt;code&gt;response.output_text&lt;/code&gt; -- that gives you a plain string with citation markers, not the actual file references.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uploading a File, Transforming It
&lt;/h3&gt;

&lt;p&gt;Upload via the standard Files API, then pass the file ID in the container config.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Upload the file
&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;purpose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pass it to code interpreter via container config
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5.2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;uploaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyze this sales CSV. Create a bar chart of revenue by region and save it as a PNG.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Download generated files from annotations
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;annotations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;container_file_citation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;file_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                            &lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="n"&gt;container_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;container_id&lt;/span&gt;
                        &lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Downloaded: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;annotation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also request higher memory tiers -- &lt;code&gt;1g&lt;/code&gt; (default), &lt;code&gt;4g&lt;/code&gt;, &lt;code&gt;16g&lt;/code&gt;, or &lt;code&gt;64g&lt;/code&gt; -- by setting &lt;code&gt;"memory_limit"&lt;/code&gt; in the container config. Useful when processing large datasets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;OpenAI Gotchas&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;cfile_&lt;/code&gt; 404 trap.&lt;/strong&gt; Generated files have IDs prefixed with &lt;code&gt;cfile_&lt;/code&gt;. If you try to download them using the standard &lt;code&gt;client.files.content()&lt;/code&gt; endpoint, you'll get a 404. You &lt;em&gt;must&lt;/em&gt; use &lt;code&gt;client.containers.files.content.retrieve()&lt;/code&gt; instead. This has tripped up every developer at least once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;20-minute container expiry.&lt;/strong&gt; OpenAI containers are ephemeral -- they expire after 20 minutes of inactivity. Download your files immediately after generation. There is no 30-day persistence like Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing annotations fallback.&lt;/strong&gt; There's a known edge case where &lt;code&gt;container_file_citation&lt;/code&gt; annotations don't appear in the response. When this happens, check &lt;code&gt;response.output&lt;/code&gt; for items of type &lt;code&gt;code_interpreter_call&lt;/code&gt; and inspect their outputs for file references:&lt;/p&gt;


&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;file_refs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_interpreter_call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;output_item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outputs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="c1"&gt;# Download using output_item.file_id and output_item.container_id
&lt;/span&gt;                    &lt;span class="k"&gt;pass&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Gemini: Inline Results + Structured Output
&lt;/h2&gt;

&lt;p&gt;Gemini takes a fundamentally different approach. It doesn't return downloadable file artifacts with file IDs. Instead, code execution results come back &lt;strong&gt;inline&lt;/strong&gt; -- matplotlib charts as raw image bytes, everything else as text or JSON.&lt;/p&gt;

&lt;p&gt;This isn't a technical limitation -- Google has the infrastructure to build containers and file artifact systems. The gap is strategic. Google's file generation story lives in &lt;strong&gt;Google Workspace&lt;/strong&gt;, not in the developer API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Docs&lt;/strong&gt; generates full first drafts from prompts, matching writing styles and pulling data from Gmail, Drive, and the web.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Sheets&lt;/strong&gt; builds entire spreadsheets from natural language and auto-populates cells with live data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini in Slides&lt;/strong&gt; generates themed slides, with full presentation generation from a single prompt on the roadmap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes business sense for Google. Anthropic and OpenAI are API-first companies -- their revenue comes from developers using their APIs, so building sandboxes and file download endpoints directly serves their customers. Google's revenue comes from Workspace subscriptions. When Gemini generates a spreadsheet in Workspace, it creates a Google Sheet (not an &lt;code&gt;.xlsx&lt;/code&gt;), keeping users in the Google ecosystem. An API that produces vendor-neutral files would undermine that.&lt;/p&gt;

&lt;p&gt;The practical implication: Gemini's API-level file generation gap is unlikely to close anytime soon. The structured output and inline image patterns below are the right long-term approaches, not temporary workarounds.&lt;/p&gt;

&lt;p&gt;For developers, this means Gemini is best suited for quick charts and data transforms, while complex document creation belongs with Claude or OpenAI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a Chart (Inline Image)
&lt;/h3&gt;

&lt;p&gt;Enable the &lt;code&gt;code_execution&lt;/code&gt; tool, then extract image bytes directly from the response parts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code_execution&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToolCodeExecution&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a bar chart of quarterly revenue: Q1=$2.1M, Q2=$2.8M, Q3=$3.2M, Q4=$3.9M.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Gemini returns results inline -- no separate download step
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Code ran:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable_code&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_execution_result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Output:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;code_execution_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_image&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_image&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chart saved as revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No file IDs, no download endpoints. The image bytes are right there in the response. For text/data output, it shows up in &lt;code&gt;code_execution_result.output&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Structured Output for CSV Generation
&lt;/h3&gt;

&lt;p&gt;Gemini's strongest file generation pattern is actually indirect: get structured JSON data back, then format it locally with whatever library you prefer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Ask for structured JSON output
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Return a JSON array of 10 tech companies with fields: name, ticker, market_cap, sector.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Convert to CSV locally -- you control the formatting
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tech_companies.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows to tech_companies.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This "structured output" approach gives you 100% control over formatting and is the most reliable way to produce files from Gemini. Let the model do what it's good at (data generation), and handle the file formatting yourself.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;30-Second Execution Timeout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini's code execution sandbox has a hard 30-second timeout. This makes it ideal for quick chart generation and data transforms, but rules it out for heavy document creation tasks like multi-page PDF reports or complex PowerPoint decks. For those, use Claude or OpenAI.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Which API for What?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Claude&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Gemini&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sandbox Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reusable container (30-day expiry)&lt;/td&gt;
&lt;td&gt;Ephemeral container (20-min idle timeout)&lt;/td&gt;
&lt;td&gt;Stateless sandbox (30s timeout)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 GiB disk, 5 GiB RAM, 1 CPU&lt;/td&gt;
&lt;td&gt;Up to 64 GB RAM (tiered)&lt;/td&gt;
&lt;td&gt;Token-limited (inline output)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shell Access&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full bash&lt;/td&gt;
&lt;td&gt;Python only&lt;/td&gt;
&lt;td&gt;Python only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;File Download&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Files API (&lt;code&gt;files.download()&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Container endpoint (&lt;code&gt;containers.files.content.retrieve()&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Inline in response (no download step)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best Use Case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex documents (PDF, DOCX, PPTX)&lt;/td&gt;
&lt;td&gt;Heavy data processing + file gen&lt;/td&gt;
&lt;td&gt;Quick charts and data transforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;pip install&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (bash access)&lt;/td&gt;
&lt;td&gt;No (isolated sandbox)&lt;/td&gt;
&lt;td&gt;No (isolated sandbox)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The short version:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex documents&lt;/strong&gt; (PDF reports, slide decks, Word docs with formatting): &lt;strong&gt;Claude&lt;/strong&gt;. The pre-installed document libraries and 30-day container persistence make it the best fit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large dataset processing&lt;/strong&gt; (crunching big CSVs, Excel transformations): &lt;strong&gt;OpenAI&lt;/strong&gt;. The ability to request up to 64 GB of RAM is unmatched.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quick visualizations&lt;/strong&gt; (charts, graphs, simple data summaries): &lt;strong&gt;Gemini&lt;/strong&gt;. Inline image return means fewer API calls and faster turnaround.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maximum formatting control&lt;/strong&gt;: Any model's &lt;strong&gt;Structured Output&lt;/strong&gt; mode. Get JSON data back, render locally with your own libraries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Self-Hosted Alternative: Run Your Own Sandbox
&lt;/h2&gt;

&lt;p&gt;The three vendor APIs above all run code in &lt;em&gt;their&lt;/em&gt; infrastructure. You send a prompt, they spin up a container, and they hand you back the file. This is convenient, but it means your data leaves your network, you're bound by each vendor's sandbox limits (30-second timeouts, no internet, fixed library sets), and you pay per-execution fees.&lt;/p&gt;

&lt;p&gt;There's a fourth option: &lt;strong&gt;run the sandbox yourself&lt;/strong&gt;. In this pattern, you call any LLM API to generate code (without enabling the vendor's code execution tool), then execute that code locally in an isolated environment on your own machines. You get the same prompt-to-file workflow, but you control the execution environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Self-Host?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data residency.&lt;/strong&gt; In regulated industries (healthcare, finance, government), sending code and data to a third-party sandbox may violate compliance requirements. A local sandbox keeps everything on your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No vendor sandbox limits.&lt;/strong&gt; You choose the timeout, the RAM, the disk, the installed libraries. Need 10 minutes of execution time? A GPU? Network access to internal services? Your sandbox, your rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost at scale.&lt;/strong&gt; Vendor sandbox pricing is per-session or per-hour. At high volume, running your own execution infrastructure can be significantly cheaper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model flexibility.&lt;/strong&gt; Since you're decoupling "generate the code" from "run the code," you can use &lt;em&gt;any&lt;/em&gt; LLM -- including open-source models, fine-tuned models, or your own -- to produce the Python script. The sandbox doesn't care where the code came from.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tools for Building It
&lt;/h3&gt;

&lt;p&gt;Two open-source projects have emerged as the leading options for sandboxed code execution:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/e2b-dev/e2b" rel="noopener noreferrer"&gt;E2B&lt;/a&gt;&lt;/strong&gt; uses Firecracker microVMs (the same technology behind AWS Lambda) to isolate each execution in its own lightweight VM with a dedicated kernel -- stronger isolation than Docker containers. E2B offers a managed cloud service, but you can also self-host on your own GCP or Linux infrastructure using their Terraform-based deployment. The Python and JavaScript SDKs make it straightforward to spin up a sandbox, run code, and retrieve files programmatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://pypi.org/project/exec-sandbox/" rel="noopener noreferrer"&gt;exec-sandbox&lt;/a&gt;&lt;/strong&gt; takes the fully-local approach. It runs untrusted code in ephemeral QEMU microVMs with hardware acceleration (KVM on Linux, HVF on macOS). No cloud dependency -- code never leaves your machine. Warm-pool latency is 1-2ms, and it supports Python, JavaScript, and shell execution. It's designed for air-gapped environments where sending code to any external service is a non-starter.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture Shift
&lt;/h3&gt;

&lt;p&gt;The key difference is that self-hosting &lt;strong&gt;decouples code generation from code execution&lt;/strong&gt;. With vendor APIs, the LLM both writes and runs the code in a single API call. With a self-hosted sandbox, you split these into two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Call the LLM API for text/code generation (no code execution tool needed).&lt;/li&gt;
&lt;li&gt;Extract the generated Python script from the response.&lt;/li&gt;
&lt;li&gt;Execute it in your local sandbox (E2B, exec-sandbox, or even a locked-down Docker container).&lt;/li&gt;
&lt;li&gt;Retrieve the output files from the sandbox filesystem.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a concrete example using E2B as the sandbox and Anthropic as the LLM. Notice there's no code execution tool in the API call -- we just ask Claude to write a script, then run it ourselves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Anthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;e2b_code_interpreter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Ask the LLM to generate a Python script (no code execution tool)
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-6&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python script that uses matplotlib to create a bar chart &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;of quarterly revenue (Q1=$2.1M, Q2=$2.8M, Q3=$3.2M, Q4=$3.9M) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and saves it as &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Return only the script, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no explanation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Extract the Python code from the response
&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

python\n(.*?)

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Execute it in an E2B sandbox
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;Sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;execution&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;execution&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Step 4: Download the generated file from the sandbox
&lt;/span&gt;        &lt;span class="n"&gt;file_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sbx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/home/user/revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved: revenue_chart.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can swap &lt;code&gt;Anthropic&lt;/code&gt; for &lt;code&gt;OpenAI&lt;/code&gt;, &lt;code&gt;genai.Client&lt;/code&gt;, or any other LLM client -- the sandbox doesn't care where the code came from. You can also upload input files to the sandbox before execution using &lt;code&gt;sbx.files.write()&lt;/code&gt;, mirroring the upload-then-process pattern from the vendor APIs.&lt;/p&gt;

&lt;p&gt;E2B's default &lt;code&gt;code-interpreter&lt;/code&gt; template comes with matplotlib, pandas, numpy, scikit-learn, pillow, openpyxl, python-docx, seaborn, and dozens of other common libraries pre-installed -- similar to the vendor sandboxes. If you need additional packages, you can either install them at runtime with &lt;code&gt;sbx.commands.run("pip install &amp;lt;package&amp;gt;")&lt;/code&gt;, or build a custom template with your dependencies baked in so every sandbox starts ready to go.&lt;/p&gt;

&lt;p&gt;This is more work to build, but it gives you full control over execution, security, and cost. It also means you can use Gemini or any other model that &lt;em&gt;doesn't&lt;/em&gt; offer file artifacts -- you just need the model to write good Python, and your sandbox handles the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Tips
&lt;/h2&gt;

&lt;p&gt;If you're building file generation into a real product, a few hard-won lessons:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Sanitize filenames.&lt;/strong&gt; The LLM chooses the filename based on the prompt. A creative user (or an adversarial one) can craft prompts that produce filenames with path traversal characters. Always strip or validate filenames before writing to disk. &lt;code&gt;os.path.basename()&lt;/code&gt; is your friend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Handle multi-file responses.&lt;/strong&gt; A single prompt like "make a PDF report and an Excel spreadsheet of the raw data" can produce two or more files. Always iterate the full response -- never assume exactly one file comes back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Persist container IDs for edit workflows.&lt;/strong&gt; Claude's 30-day containers enable a powerful pattern: users can say "update the chart on page 2" in a follow-up message, and the LLM picks up the original file from the persistent container. Store the &lt;code&gt;container_id&lt;/code&gt; alongside the conversation thread in your database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set timeouts generously.&lt;/strong&gt; Code execution is significantly slower than text generation. Simple files might take 30-60 seconds; complex multi-file generation (especially PPTX with embedded charts) can take 5-15 minutes. Don't use your standard API timeout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. All sandboxes are offline.&lt;/strong&gt; None of the three vendors allow network access from within the sandbox. All data must be uploaded or included in the prompt. You can't &lt;code&gt;pip install&lt;/code&gt; on OpenAI or Gemini (Claude is the exception -- it has bash access). You can't fetch URLs. Plan accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;File generation via LLM APIs follows a universal pattern across all three major vendors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; excels at complex document creation with its 30-day persistent containers, full bash access, and pre-installed document libraries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; offers the most compute headroom with up to 64 GB of RAM, making it ideal for heavy data processing tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; is the fastest path to charts and visualizations, returning inline image bytes with no separate download step.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Try it yourself:&lt;/strong&gt; Build a CLI tool that takes a prompt and a desired output format, routes to the best vendor based on file type (PDFs to Claude, big data to OpenAI, charts to Gemini), and saves the result locally. You'll touch all three APIs and internalize the patterns in a single afternoon.&lt;/p&gt;

&lt;h3&gt;
  
  
  Official Documentation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool" rel="noopener noreferrer"&gt;Anthropic Code Execution Tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://platform.claude.com/docs/en/build-with-claude/files" rel="noopener noreferrer"&gt;Anthropic Files API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs/guides/tools-code-interpreter/" rel="noopener noreferrer"&gt;OpenAI Code Interpreter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.openai.com/api/docs" rel="noopener noreferrer"&gt;OpenAI API Reference&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs/code-execution" rel="noopener noreferrer"&gt;Gemini Code Execution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs" rel="noopener noreferrer"&gt;Gemini API Documentation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Skills Required for Building AI Agents in 2026</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:44:03 +0000</pubDate>
      <link>https://dev.to/imaginex/skills-required-for-building-ai-agents-in-2026-2ed</link>
      <guid>https://dev.to/imaginex/skills-required-for-building-ai-agents-in-2026-2ed</guid>
      <description>&lt;h2&gt;
  
  
  Why Agent Development Is Harder Than You Think
&lt;/h2&gt;

&lt;p&gt;An Agent is conceptually simple: take the one-question-one-answer model of an LLM and add a loop. The model reasons about what to do next, calls external tools, feeds results back into itself, and repeats until the task is complete. A &lt;code&gt;while&lt;/code&gt; loop plus tool-calling — that's the skeleton.&lt;/p&gt;

&lt;p&gt;But between "working demo" and "production product" lies an engineering chasm. OAuth flows, tool design, error cascading across multi-step tasks, runaway costs, context window management, evaluation, multi-Agent coordination, model capability bottlenecks, and framework trade-offs — these nine challenges are where Agent development &lt;em&gt;actually&lt;/em&gt; gets hard. API calls account for roughly 5% of the total effort; the other 95% is everything else.&lt;/p&gt;

&lt;p&gt;For a detailed walkthrough of each challenge, see the companion piece: &lt;a href="//agent_dev_issues.md"&gt;Is AI Agent Development Just About Calling APIs?&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The question this post addresses is different: &lt;strong&gt;given that Agent development is hard, what skills do you actually need to succeed at it in 2026?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skill Shift: From Writing Code to Shaping Problems
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Inspired by a Story: How an Intern Outperformed a Senior Engineer?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/shubhamsaboo/" rel="noopener noreferrer"&gt;Shubham Saboo&lt;/a&gt; — Senior AI Product Manager at Google Cloud, founder of Unwind AI, and co-author of Google's &lt;em&gt;Introduction to Agents&lt;/em&gt; whitepaper — recently shared an experience from a startup where he serves as an advisor. Something happened that overturned everyone's assumptions.&lt;/p&gt;

&lt;p&gt;A senior engineer received a task and followed the traditional workflow: understand requirements, design architecture, write code, debug, and test. Three days later, he delivered a technically flawless solution -- clean code, clear logic, fully compliant with engineering standards.&lt;/p&gt;

&lt;p&gt;An intern completed the same task in a single afternoon.&lt;/p&gt;

&lt;p&gt;It wasn't that the intern had superior technical skills. Quite the opposite -- his coding experience was far less than the senior engineer's. But he did something fundamentally different: &lt;strong&gt;he defined the problem clearly enough, then let Claude Code do the rest.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This scenario reveals a harsh reality: when AI can complete implementation-level work quickly and accurately, the bottleneck shifts entirely upstream. The value is no longer &lt;em&gt;"Can you write this code?"&lt;/em&gt; but rather &lt;em&gt;"Can you decompose the problem to a level where AI almost never makes mistakes?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An even more striking example comes from inside Anthropic. They had Opus 4.6 build a C compiler using a team of Agents, then essentially stepped back. Two weeks later, it could run on the Linux kernel -- 100,000 lines of working Rust code, without a single line written by a human.&lt;/p&gt;

&lt;p&gt;The researcher leading this project, &lt;a href="https://nicholas.carlini.com/" rel="noopener noreferrer"&gt;Nicholas Carlini&lt;/a&gt; — a research scientist at Anthropic known for his work on adversarial machine learning — did only one thing: &lt;strong&gt;problem decomposition.&lt;/strong&gt; He broke down the vague goal of "build a compiler" into 16 precisely defined subtasks, each with clear inputs, outputs, and success criteria. Then 16 Agents, each handling its own piece, completed the entire compiler.&lt;/p&gt;

&lt;p&gt;The real leverage isn't in writing code -- it's in breaking problems down to the point where AI almost never gets it wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Four Skills That Are No Longer Differentiating
&lt;/h3&gt;

&lt;p&gt;Shubham argues that four capabilities that once commanded high salaries for developers are rapidly losing their power as differentiators — not because they're useless, but because AI has made them table stakes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Writing code from scratch.&lt;/strong&gt; Agents write faster and produce fewer bugs. The ability to hand-write code still matters as foundational knowledge, but it's no longer what sets great developers apart.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate code and project scaffolding.&lt;/strong&gt; A single prompt generates them instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memorizing syntax and APIs.&lt;/strong&gt; Extended context windows have already solved this problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Translating specifications into code.&lt;/strong&gt; Now, the specification itself &lt;em&gt;is&lt;/em&gt; the code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These skills were once valuable because implementation itself was hard. They required years of training and justified six-figure salaries. But &lt;strong&gt;implementation is no longer the bottleneck&lt;/strong&gt; — it's becoming the easy part.&lt;/p&gt;

&lt;p&gt;Yet the entire industry is still optimizing around the old bottleneck. Most companies' job descriptions still emphasize "proficient in Java," "familiar with Spring framework," "5+ years of development experience." These criteria are losing relevance at a visible pace.&lt;/p&gt;

&lt;p&gt;Value has migrated to five new skills.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Five Skills That Truly Matter in 2026
&lt;/h3&gt;

&lt;p&gt;I am tryiing to answer this question. This isn't theoretical speculation -- it's what I has witnessed firsthand when developing AI solutions in the past 2 years, in the open-source community, and through countless experiences building Agents.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Problem Shaping
&lt;/h4&gt;

&lt;p&gt;Turning vague goals into executable tasks -- this skill separates people who "play around with AI" from those who actually build products with it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Build me a dashboard"&lt;/em&gt; is not a task; it's a wish. Problem shaping breaks it into twelve specific, testable subtasks: What data does this dashboard display? What decisions does it support? What must the user understand within the first three seconds? Each sub-problem has clear inputs, clear outputs, and clear success criteria.&lt;/p&gt;

&lt;p&gt;When you decompose a vague goal into precise sub-problems, the Agent's execution quality transforms entirely. It no longer needs to guess your intent -- it just follows clear instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to practice problem shaping:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with the desired output and work backwards — what does "done" look like?&lt;/li&gt;
&lt;li&gt;For each subtask, define three things: the input it receives, the output it produces, and how you'll know it succeeded.&lt;/li&gt;
&lt;li&gt;If a subtask is still ambiguous enough that two people would interpret it differently, break it down further.&lt;/li&gt;
&lt;li&gt;Verify your decomposition by asking: could a competent person with zero context about this project execute each subtask from the description alone?&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  2. Context Design
&lt;/h4&gt;

&lt;p&gt;Agent output quality is directly proportional to the quality of context you provide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Poor context:&lt;/strong&gt; &lt;em&gt;"Build me a customer support agent."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good context:&lt;/strong&gt; &lt;em&gt;"The target users are SaaS customers considering canceling their subscriptions who have already tried the help documentation but failed. The tone should be empathetic yet efficient -- avoid excessive apologies and robotic responses. Here are 3 real cases that received five-star ratings and 2 cases that received complaints. Edge cases requiring human escalation include: billing disputes over $500, account security issues, and legal compliance matters. The success metric is resolving the issue within 4 messages without escalation."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The difference isn't in prompt engineering tricks. It's in &lt;strong&gt;information density, boundary conditions, success criteria, and understanding of real-world scenarios.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A context design checklist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who is the target user, and what is their state of mind?&lt;/li&gt;
&lt;li&gt;What does the desired tone sound like? Provide 2–3 real examples, not adjectives.&lt;/li&gt;
&lt;li&gt;What are the edge cases that require special handling or human escalation?&lt;/li&gt;
&lt;li&gt;What does success look like, in measurable terms?&lt;/li&gt;
&lt;li&gt;What are the most common failure modes, and how should the Agent handle them?&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  3. Aesthetic Judgment
&lt;/h4&gt;

&lt;p&gt;When ten options are in front of you, knowing that nine of them won't work.&lt;/p&gt;

&lt;p&gt;Shubham recently had Antigravity build a bargaining simulator for his repository: two Agents negotiating a used car deal, each with a distinct personality, live-streaming the entire process. The first version ran perfectly -- clean code, no errors, both sides going back and forth. Technically complete.&lt;/p&gt;

&lt;p&gt;He rejected it in thirty seconds.&lt;/p&gt;

&lt;p&gt;The interface was just a plain chat window. The negotiation process read like a log file -- no personality tension, no emotional highs and lows, no dramatic moments of &lt;em&gt;"Shark Steve holding the line against Cool-Hand Casey pretending to walk away."&lt;/em&gt; It worked as software; it failed as an experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An Agent can build anything you describe, but it cannot judge what is worth describing.&lt;/strong&gt; Agents optimize for &lt;em&gt;correctness&lt;/em&gt;; humans optimize for &lt;em&gt;"Would anyone actually want to use this?"&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Agent Orchestration
&lt;/h4&gt;

&lt;p&gt;Knowing when to use one Agent, when to use multiple, when to run them in parallel, when to run them sequentially, when to add guardrails, and when to let go.&lt;/p&gt;

&lt;p&gt;Three core patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sequential pipeline:&lt;/strong&gt; Agent A completes its task and passes the output to Agent B. Best for scenarios with dependencies between steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coordinator + specialist team:&lt;/strong&gt; A lead Agent dispatches tasks and integrates results. Best for complex tasks requiring quality control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel execution + merge:&lt;/strong&gt; Multiple Agents handle independent tasks simultaneously, with results consolidated at the end. Best for scenarios with no dependencies between subtasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most people default to sequential workflows because they feel "safer." But knowing when to parallelize and when to introduce a coordinator determines whether your workflow finishes in five minutes or drags on for an hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A practical rule of thumb:&lt;/strong&gt; If two subtasks don't share state — neither reads what the other writes — they can run in parallel. If one subtask's output determines what the next subtask even &lt;em&gt;is&lt;/em&gt;, they must be sequential. And if you have more than three parallel Agents whose outputs need to be merged, introduce a coordinator to avoid contradictory results.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Knowing When NOT to Use an Agent
&lt;/h4&gt;

&lt;p&gt;Not every problem needs an Agent.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Need to reformat JSON? Hand it to Gemini 3 Flash -- done in ten seconds.&lt;/li&gt;
&lt;li&gt;Text replacement across ten files? A lightweight model handles it in seconds.&lt;/li&gt;
&lt;li&gt;A bug you already fully understand? Fixing it yourself is faster than explaining it to an Agent.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;True capability is matching the right tool to the problem.&lt;/strong&gt; Complex problems get Agents. Simple problems get models. Obvious problems get your keyboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conway's Law Restructured in the Age of AI
&lt;/h3&gt;

&lt;p&gt;In the classic book &lt;em&gt;The Mythical Man-Month&lt;/em&gt;, Fred Brooks proposed a famous insight: a software system's architecture will inevitably mirror the communication structure of the organization that built it. This became known as &lt;strong&gt;Conway's Law.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building AI agents is essentially restructuring Conway's Law with AI.&lt;/p&gt;

&lt;p&gt;In traditional software development, the speed of delivering a feature depends on team size, communication efficiency, and technical debt. You need frontend engineers, backend engineers, QA engineers, countless meetings to align requirements, and long develop-test-fix cycles.&lt;/p&gt;

&lt;p&gt;In the Agent era, this chain is compressed. &lt;strong&gt;One person plus 16 Agents can build a compiler in two weeks. One intern plus Claude Code can accomplish in an afternoon what took a senior engineer three days.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Organizational structure is no longer the bottleneck. &lt;strong&gt;The quality of problem definition is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is why Shubham says the best developers of 2026 look more like &lt;strong&gt;film directors&lt;/strong&gt; than programmers. They set the scene, cast the actors, and know when to call "cut." They don't write every line of dialogue -- they shape the entire production.&lt;/p&gt;

&lt;p&gt;The essence of programming is shifting from &lt;strong&gt;"writing"&lt;/strong&gt; to &lt;strong&gt;"orchestrating."&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Limitations You Must Know
&lt;/h3&gt;

&lt;p&gt;Although Agents sound like magic, you must be aware of three limitations when applying them in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agent quality is highly dependent on problem definition.&lt;/strong&gt; If you cannot decompose the problem clearly enough, the Agent will consistently produce outputs in the wrong direction. This isn't the Agent's fault -- it's a problem-shaping problem. Before you master this skill, Agents may actually slow you down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Context design requires deep business understanding.&lt;/strong&gt; Writing a good &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;.cursor/rules&lt;/code&gt; file requires you to truly understand the product's worldview, users' pain points, and success criteria. This understanding cannot be rushed -- it can only be accumulated through repeated shipping and observing real user behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Aesthetic judgment cannot be learned from books.&lt;/strong&gt; It comes from repeated shipping, observing real user behavior, and developing sensitivity to the gap between &lt;em&gt;"it works"&lt;/em&gt; and &lt;em&gt;"it's worth using."&lt;/em&gt; Without this accumulation, Agents will help you rapidly produce a large volume of things that are &lt;em&gt;"technically correct but experientially failed."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  State Management: Problem Shaping Applied to Execution
&lt;/h3&gt;

&lt;p&gt;All five skills above come into sharpest focus in one practical engineering challenge: &lt;strong&gt;state management.&lt;/strong&gt; An Agent that can plan is worthless if it can’t track its own progress. Without a progress-tracking mechanism, Agents fall into "hallucination loops" — repeating steps, losing track of the original goal, or confidently declaring a task complete when it’s half-done.&lt;/p&gt;

&lt;p&gt;This is where all five skills converge — applied not to a product or a user-facing feature, but to the Agent itself. Each of the four patterns below draws on a different combination of skills:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The "Plan-Act-Observe" Loop (ReAct pattern).&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #2 Context Design)&lt;/em&gt; Instead of handing the Agent a giant task list, force it to update its internal state after every single action. The Agent explains what it intends to do (Thought), calls a tool (Action), receives the raw result (Observation), then compares that result against the original plan (Status Update). The loop itself is problem shaping — breaking execution into atomic Thought→Action→Observation cycles. The status update after each cycle is context design — ensuring the Agent's next decision is informed by accurate, structured state rather than stale memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Dynamic Task Graphs.&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #4 Agent Orchestration)&lt;/em&gt; For complex workflows, static to-do lists break down. Use a directed acyclic graph (DAG) or dynamic task queue where each task carries a status (&lt;code&gt;PENDING&lt;/code&gt;, &lt;code&gt;IN_PROGRESS&lt;/code&gt;, &lt;code&gt;COMPLETED&lt;/code&gt;, &lt;code&gt;FAILED&lt;/code&gt;), dependencies are tracked explicitly (Task B doesn’t start until Task A succeeds), and intermediate variables are stored in a scratchpad — like a URL found in Step 1 that’s needed in Step 5. Defining each node with clear inputs, outputs, and success criteria is problem shaping. Deciding which nodes run in parallel versus sequentially, and how results flow between them, is agent orchestration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Critic Node.&lt;/strong&gt; &lt;em&gt;(Skill #3 Aesthetic Judgment + Skill #4 Agent Orchestration)&lt;/em&gt; In multi-Agent architectures, it helps to have a supervisor that reviews outputs rather than just trusting the worker’s self-assessment. The Worker executes and reports "I’m done." The Critic checks whether the goal was &lt;em&gt;actually&lt;/em&gt; achieved. A shared Global State stores the current version of truth. This is the Coordinator pattern from Skill #4 applied to quality control — but the Critic’s evaluation criteria come from Skill #3: knowing when output is "technically correct" but not actually good enough. Without aesthetic judgment baked into the Critic’s rubric, it degrades into a syntax checker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Checkpointing and Self-Correction.&lt;/strong&gt; &lt;em&gt;(Skill #1 Problem Shaping + Skill #5 Knowing When NOT to Use an Agent)&lt;/em&gt; Progress tracking isn’t just about moving forward — it’s about knowing when to turn back. If an observation returns an error, the Agent should update the plan rather than crash — that’s problem shaping in real time, re-decomposing the remaining work based on new information. And if an Agent is 50 steps deep into what should be a 5-step task, it’s "lost in the woods" and needs a reset. Budget monitoring (tokens, turns, or wall-clock time) prevents runaway execution. Recognizing when to abort an Agent run and switch to a simpler tool — or fix the issue manually — is Skill #5 in action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A practical implementation tip:&lt;/strong&gt; &lt;em&gt;(Skill #2 Context Design)&lt;/em&gt; Prepend a status summary to every LLM call — original goal, completed steps, current step, remaining steps. This is context design at its most literal: engineering the information the Agent sees at every turn. This "external state" acts as a rhythmic beat that keeps the context window focused on the finish line, counteracting the "Agentic Amnesia" problem described in the companion piece.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting It Into Practice
&lt;/h3&gt;

&lt;p&gt;I close with a poignant statement: &lt;em&gt;"These skills cannot be acquired through reading. They come from practice."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I offer five concrete exercises:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Review your last five Agent outputs.&lt;/strong&gt; Write down what you would change and why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write a &lt;code&gt;CLAUDE.md&lt;/code&gt; for your current project&lt;/strong&gt; -- even if it only takes 30 minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The next time you face a vague requirement,&lt;/strong&gt; break it into 10 subtasks before writing a prompt.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Take a sequential workflow&lt;/strong&gt; and identify which steps can run in parallel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For one week, log every task&lt;/strong&gt; where you used an Agent but a simple prompt would have sufficed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Open your most recent project and ask yourself: &lt;em&gt;Are you spending more time writing code, or shaping problems?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="//agent_dev_issues.md"&gt;ten engineering challenges of building AI agents&lt;/a&gt; haven't gone away. But the response to them has fundamentally shifted.&lt;/p&gt;

&lt;p&gt;Twenty years ago, the scarce resource was implementation skill — the ability to translate an idea into working code. That scarcity justified years of training, specialized hiring, and the entire structure of software teams. Today, Agents handle implementation at speed and quality that rivals senior engineers. The scarce resource has moved upstream: the ability to decompose problems precisely, design rich context, exercise aesthetic judgment, orchestrate multi-Agent workflows, and know when to reach for a simpler tool.&lt;/p&gt;

&lt;p&gt;This isn't a prediction about the future. It's a description of what's already happening — an intern shipping in an afternoon, a compiler built without a human writing a single line of code, organizations discovering that their bottleneck is problem definition, not programming talent.&lt;/p&gt;

&lt;p&gt;The developers who thrive in this era won't be the ones who write the most code. They'll be the ones who ask the best questions, shape the clearest problems, and know when the Agent's output is good enough — and when it isn't.&lt;/p&gt;

&lt;p&gt;The skills have shifted. The question is whether you'll shift with them.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Berkeley Function-Calling Leaderboard&lt;/strong&gt; — Tool-calling accuracy benchmarks across models (~77.5% top accuracy). &lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html" rel="noopener noreferrer"&gt;berkeley-function-call-leaderboard&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Galileo Research&lt;/strong&gt; — Findings on error cascading in multi-step Agent tasks. &lt;a href="https://www.galileo.ai/" rel="noopener noreferrer"&gt;galileo.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangChain State of AI Agents Report&lt;/strong&gt; — Survey data on Agent evaluation practices (52% offline evaluation, 37% online evaluation). &lt;a href="https://blog.langchain.dev/" rel="noopener noreferrer"&gt;blog.langchain.dev&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UC Berkeley MAST Framework&lt;/strong&gt; — Analysis of 1,600+ Agent traces showing 41–86.7% multi-Agent failure rates, with 79% of failures from orchestration. &lt;a href="https://arxiv.org/abs/2503.13657" rel="noopener noreferrer"&gt;arxiv.org&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Azure SRE Case Study&lt;/strong&gt; — Production experience scaling from 50+ sub-Agents to 5 core tools. &lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/context-engineering-lessons-from-building-azure-sre-agent/4481200" rel="noopener noreferrer"&gt;techcommunity.microsoft.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic Agent Evaluation Blog (January 2025)&lt;/strong&gt; — Challenges in systematically evaluating Agent behavior. &lt;a href="https://www.anthropic.com/research/building-effective-agents" rel="noopener noreferrer"&gt;anthropic.com/research&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nicholas Carlini — C Compiler with Opus&lt;/strong&gt; — Building a C compiler with 16 Agents producing 100,000 lines of Rust. &lt;a href="https://nicholas.carlini.com/" rel="noopener noreferrer"&gt;nicholas.carlini.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shubham Saboo / Unwind AI&lt;/strong&gt; — &lt;a href="https://www.theunwindai.com/" rel="noopener noreferrer"&gt;theunwindai.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boston Consulting Group&lt;/strong&gt; — Research showing fewer than 20% of enterprise Agent projects achieve expected ROI. &lt;a href="https://www.bcg.com/publications/2025/closing-the-ai-impact-gap" rel="noopener noreferrer"&gt;bcg.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alibaba Cloud Engineering Blog&lt;/strong&gt; — Data showing AI completes 30% of work in production Agent systems, with 70% being tool engineering. &lt;a href="https://www.alibabacloud.com/blog/602301" rel="noopener noreferrer"&gt;alibabacloud.com/blog&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spotify Engineering&lt;/strong&gt; — Experience with context window limits in code Agent development. &lt;a href="https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2" rel="noopener noreferrer"&gt;engineering.atspotify.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manus Team&lt;/strong&gt; — Four framework rebuilds for context engineering. &lt;a href="https://manus.im/" rel="noopener noreferrer"&gt;manus.im&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fred Brooks, The Mythical Man-Month&lt;/strong&gt; — Origin of Conway's Law and organizational structure insights. &lt;a href="https://en.wikipedia.org/wiki/The_Mythical_Man-Month" rel="noopener noreferrer"&gt;wikipedia.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>programming</category>
    </item>
    <item>
      <title>Is AI Agent Development Just About Calling APIs? Where's the Real Difficulty?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 25 Feb 2026 19:40:02 +0000</pubDate>
      <link>https://dev.to/imaginex/is-ai-agent-development-just-about-calling-apis-wheres-the-real-difficulty-2j75</link>
      <guid>https://dev.to/imaginex/is-ai-agent-development-just-about-calling-apis-wheres-the-real-difficulty-2j75</guid>
      <description>&lt;h2&gt;
  
  
  The Bottom Line First
&lt;/h2&gt;

&lt;p&gt;Calling APIs is indeed the entirety of Agent development — just like cooking is indeed putting ingredients in a pot. Technically correct, but it perfectly explains why some people produce Michelin-star dishes while others produce culinary disasters.&lt;/p&gt;

&lt;p&gt;Saying the conclusion without explanation is meaningless. Let's actually build an Agent and walk through it together. But before diving in, let's take 30 seconds to clarify what an Agent actually is.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is an Agent, Exactly?
&lt;/h2&gt;

&lt;p&gt;The original interaction model with large language models (LLMs) was simple: you ask a question, it gives an answer. One question, one answer, done. If you wanted it to do something complex, you had to manually break tasks into small pieces and feed them one round at a time. You were the "orchestrator"; the LLM was just a passive tool that responded on demand.&lt;/p&gt;

&lt;p&gt;What an Agent does is fundamentally one thing: &lt;strong&gt;it adds a loop to this question-and-answer model.&lt;/strong&gt; The model no longer just answers you once. Instead, it judges "what else do I need to do," calls external tools to get results, feeds those results back to itself, thinks about what to do next, and repeats until the task is complete. This loop transforms a large model from a "responder" into an "executor."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Execution Loop:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Input → LLM Reasoning → Need to call a tool?
                                      │
                    ┌─── Yes ─────────┘─── No ───┐
                    ▼                             ▼
           Select Appropriate Tool         Task Complete?
                    │                             │
                    ▼                        Yes  ▼
           Call External Tool          Return Final Result
         ┌──────────────────┐
         │  Check Emails    │
         │  Check Calendar  │
         │  Create Meeting  │
         └──────────────────┘
                    │
                    ▼
         Get Tool Return Results
                    │
                    ▼
           Update Context ──────────────────────────────┐
                                                        │
                                              (loop continues)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conceptually, it's that simple. A &lt;code&gt;while&lt;/code&gt; loop plus tool-calling capability — that's your Agent skeleton. So many people read this and think, "There's no real technical depth here?" True, the skeleton is simple. But making that loop run stably, reliably, and efficiently in the real world — &lt;strong&gt;that&lt;/strong&gt; is the real engineering challenge.&lt;/p&gt;

&lt;p&gt;Let's walk through it for real. Say you want to build an Agent that manages your schedule: read emails, check calendars, arrange meetings. Doesn't sound complicated, right? Let's look at what you encounter at each step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Call the API — Done in 10 Minutes
&lt;/h2&gt;

&lt;p&gt;This step really is easy. Install an SDK, write a few lines of code, pass user input to the model, get back a result. If you've used the OpenAI or Claude API, you could write it blindfolded. You don't even need to write code yourself — open an AI coding tool like Claude Code or Cursor, describe your requirements in natural language, and they'll scaffold the project for you. Define a few tools — check calendar, read emails, create meeting — write the JSON schema, and the model can call them.&lt;/p&gt;

&lt;p&gt;It runs. You ask it "what meetings do I have tomorrow?", it calls the calendar tool, gets the result, and reads it back in natural language. Perfect. You think: Agent development isn't that hard, maybe I can ship this in a week.&lt;/p&gt;

&lt;p&gt;I've had this feeling before. 20 years ago when I first learned C# development, I dragged a few controls onto a Windows Form and had a running App — I thought Windows Form development was no big deal either.&lt;/p&gt;

&lt;p&gt;In theory, those AI coding Agents could handle every step ahead for you too. But in practice, every problem you encounter from here on isn't about &lt;em&gt;how to write the code&lt;/em&gt; — it's about &lt;em&gt;what code should be written&lt;/em&gt;. To really understand where Agent development gets hard, let's keep walking.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Connect to Real APIs — The Nightmare Begins
&lt;/h2&gt;

&lt;p&gt;In the demo you used mock data. Now you need to connect to real email and calendar services. Each user might use something different: Outlook, Gmail, hotmail, etc. Let's simplify and just connect to Microsoft's Graph API — it's accessible domestically and Outlook is mainstream in enterprise.&lt;/p&gt;

&lt;p&gt;The first problem arrives immediately: &lt;strong&gt;OAuth&lt;/strong&gt;. Your users must authorize your application to access their Microsoft account. You need to register an app in Azure AD, handle OAuth redirects, securely store refresh tokens, and auto-refresh when tokens expire. None of this has anything to do with the LLM, but without it, your Agent can't take its first step. Microsoft's permissions model alone (delegated permissions vs. application permissions) can eat half a day of research.&lt;/p&gt;

&lt;p&gt;Then come the &lt;strong&gt;API edge cases&lt;/strong&gt;. Microsoft Graph returns email lists paginated — 10 items per page by default, up to 50. Your Agent gets the first page without knowing how many more pages exist, and it will give you a confident-sounding conclusion based on just those 10 emails. Ask "did anyone email me last week about Project A?" — the actual email is on page 3, but the Agent confidently tells you "no." You can add a tool to check the next page, but then you need to add a tool to check the next page, and so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting&lt;/strong&gt; is another problem. Microsoft Graph's throttling strategy is complex, with different thresholds per app, per user, and per resource type. If your Agent calls it a dozen times in a complex task, it will easily hit a 429 error. What happens then? The model doesn't know what "429 Too Many Requests" means — it just thinks the tool call failed and starts guessing reasons. And this is only for one provider. To build a real product, every provider (Gmail, hotmail, etc.) has its own authentication system and API design. The workload multiplies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tool Design Problem:&lt;/strong&gt; Connecting the API is only half of the tool-call equation. The other half is &lt;strong&gt;how to design the tool itself&lt;/strong&gt; — and this is trickier than you'd expect.&lt;/p&gt;

&lt;p&gt;What should your "search emails" tool look like? If it's too rigid — only supporting sender-based queries — a user saying "find last week's emails about Project A" will fail immediately. So you add keyword search, time range filtering, attachment filtering? The more parameters, the more complex the schema, and the more likely the model is to fill things in wrong or miss fields. Berkeley's Function-Calling benchmark found that the more tools and the more complex the parameters, the worse model accuracy becomes. Smaller models degrade dramatically as tool count grows — BFCL data shows that models like Llama 3.1 8B can handle a modest number of tools but start failing unpredictably once tool count exceeds their capacity threshold.&lt;/p&gt;

&lt;p&gt;On the other end, if you design a generic "search" tool that covers everything, the model won't know what to put in it. It might pass calendar query parameters to the email search tool, or call "send email" when it should "create a meeting." There's no right answer for tool granularity — too fine and user needs aren't covered, too coarse and the model can't handle it. The only way is to iterate in your specific context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool description text matters enormously.&lt;/strong&gt; For the same functionality, a description written as &lt;code&gt;"Search emails"&lt;/code&gt; vs. &lt;code&gt;"Search the user's Outlook inbox by keyword, sender, date range, or attachment presence. Returns a list of matching emails sorted by date"&lt;/code&gt; produces dramatically different model accuracy. In short, you don't just need to write code to implement a tool — you need to learn to &lt;strong&gt;write a manual for the model&lt;/strong&gt;, and whether that manual is good or bad, you can only verify through repeated testing.&lt;/p&gt;

&lt;p&gt;A lot of research puts it clearly with data: &lt;strong&gt;in production-grade Agent systems, AI completes only 30% of the work, and the remaining 70% is tool engineering.&lt;/strong&gt; What you think of as "calling an API" is mostly spent on the design and integration work surrounding that API.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Multi-step Tasks — Errors Start Snowballing
&lt;/h2&gt;

&lt;p&gt;Good — the API is connected and basically working. Now try a slightly more complex request: "Find a time slot next week when everyone is free, schedule a project review meeting, and then email all attendees."&lt;/p&gt;

&lt;p&gt;This task requires: querying multiple people's calendar availability, finding the intersection, creating a meeting invite, drafting an email, and sending it. Five or six steps, each depending on the previous one's result.&lt;/p&gt;

&lt;p&gt;Here's the problem. Berkeley's Function-Calling Leaderboard (BFCL) shows that even the best models struggle with tool-calling accuracy — top scores hover around &lt;strong&gt;80%&lt;/strong&gt; on overall benchmarks, and accuracy drops further as tool count and parameter complexity increase. That means roughly 1 in 5 calls has an error. The probability of a five-step task completing entirely correctly? About 0.8 to the fifth power — &lt;strong&gt;less than 33%.&lt;/strong&gt; Your Agent has roughly a two-thirds chance of going wrong at some step.&lt;/p&gt;

&lt;p&gt;Worse, Galileo's research found that &lt;strong&gt;early small errors amplify through later steps.&lt;/strong&gt; Say the model misparses a date format in step one and reads Tuesday as Wednesday. Every subsequent step builds on that error. It creates a meeting at the wrong time, then sends everyone an email notification with the wrong time. One small hallucination triggers a cascade of wrong actions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At this point you realize: you need to add validation logic between each step, rollback mechanisms, and confirmation loops. None of this is taught in any LLM's API documentation.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Guardrails — The Invisible Security Risk
&lt;/h2&gt;

&lt;p&gt;And there's a deeper problem lurking here that most people don't think about until it's too late: &lt;strong&gt;guardrails.&lt;/strong&gt; Your scheduling Agent has permissions to send emails, create meetings, and modify calendars. What happens when it hallucinates a participant name and sends a meeting invite to the wrong person? Or confidently deletes a calendar block because it "optimized" your schedule?&lt;/p&gt;

&lt;p&gt;OWASP classifies this as &lt;strong&gt;"Excessive Agency"&lt;/strong&gt; (LLM06:2025) — one of the top security threats in LLM applications. It breaks down into three failure modes: excessive functionality (your Agent has access to 50 actions when it only needs 5), excessive permissions (your Agent can modify &lt;em&gt;any&lt;/em&gt; calendar, not just the user's), and excessive autonomy (the Agent sends emails and creates meetings without any human confirmation gate).&lt;/p&gt;

&lt;p&gt;In practice, you need to separate "read" tools from "write" tools and put explicit approval gates on write operations. High-stakes actions — sending external emails, deleting calendar entries, modifying shared resources — should run in a "dry run" mode where the Agent describes what it &lt;em&gt;would&lt;/em&gt; do and waits for human confirmation before executing. You need to design for rapid rollback, because the question isn't &lt;em&gt;if&lt;/em&gt; your Agent will take a wrong action — it's &lt;em&gt;when&lt;/em&gt;. And you need to enforce the principle of least privilege: your Agent should request only the minimum API permissions it needs, not broad access "just in case."&lt;/p&gt;

&lt;p&gt;None of this is glamorous engineering. But skip it, and one hallucinated email from your Agent can undo months of user trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Open It to Real Users — The Bill Scares You Awake
&lt;/h2&gt;

&lt;p&gt;You tested the first three steps in your development environment and things seemed fine. But once you open the Agent to real users, the nightmare comes from a direction you never anticipated: &lt;strong&gt;the bill.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You used Claude Sonnet or GPT-4o for development testing — great results, a few cents per complex task, no pain. But with real users, hundreds of requests per day, each averaging four or five tool call rounds, each carrying substantial context — you look at the monthly bill and see a small feature burning thousands of dollars a month. What if user volume grows ten times?&lt;/p&gt;

&lt;p&gt;You think: a user saying "what meetings do I have tomorrow?" — does that really need the most powerful model? That's overkill.&lt;/p&gt;

&lt;p&gt;So you start thinking about &lt;strong&gt;model routing&lt;/strong&gt;: different tasks use different base models. Simple queries go to cheap small models (Haiku, GPT-4o mini, Gemini Flash); complex multi-step reasoning goes to large models (Claude Sonnet, GPT-4o, Gemini Pro). But who judges complexity?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use a large model to judge? That costs money too.&lt;/li&gt;
&lt;li&gt;Use a rule engine? Works for simple cases, but user inputs are endlessly variable and rules always have gaps.&lt;/li&gt;
&lt;li&gt;Use a small model as a classifier? Now you've added another model component that needs tuning and maintenance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And different models vary enormously in their tool-calling capabilities. A tool schema that works on Claude Sonnet may have parameters filled in wrong on Haiku. JSON that runs perfectly on GPT-4o may fail to parse on open-source models. Every time you swap a model, your carefully tuned prompts and tool descriptions may need to be re-adapted. This is why many teams eventually find that the token money saved doesn't cover the labor cost of multi-model adaptation.&lt;/p&gt;

&lt;p&gt;To put concrete numbers on this: Claude Sonnet costs \$3/\$15 per million input/output tokens, while Claude Haiku costs \$0.25/\$1.25 — a 12x to 60x difference. GPT-4o vs. GPT-4o mini has a similar spread. Mid-sized Agent deployments easily burn \$1K–\$5K per month in token costs alone; complex Agents consuming 5–10 million tokens monthly aren't unusual. One underrated optimization: &lt;strong&gt;prompt caching.&lt;/strong&gt; Anthropic's prefix caching can reduce costs by up to 90% and latency by 85% for repeated long prompts — a massive win for Agents that include the same system prompt and tool definitions in every call.&lt;/p&gt;

&lt;p&gt;And cost isn't the only scaling problem — &lt;strong&gt;latency&lt;/strong&gt; hits you just as hard. A multi-step scheduling task that checks four people's calendars, finds a common slot, creates a meeting, and sends emails can easily take 30–45 seconds end-to-end. Technically correct, but your users experience it as broken. The biggest UX win is &lt;strong&gt;streaming intermediate results&lt;/strong&gt;: instead of a 45-second black box, show "Checking Alice's calendar... Found 3 available slots... Confirming with Bob..." — the total time is the same, but the perceived wait drops dramatically. Parallelizing independent tool calls (check all four calendars simultaneously instead of sequentially) helps with actual latency. But the hard tradeoff remains: smaller, faster models hallucinate more, so you can't just throw Haiku at everything to speed things up.&lt;/p&gt;

&lt;p&gt;Cost optimization looks like an operations problem, but it's actually an &lt;strong&gt;architecture problem&lt;/strong&gt;. You need to make the model-calling layer pluggable from the very beginning — something most people never think about when writing a demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Context Management — Your Agent Starts "Forgetting"
&lt;/h2&gt;

&lt;p&gt;After a while, you notice a strange problem: the Agent "drifts" during long tasks. You give it a complex task requiring seven or eight conversation rounds, and by rounds four or five, it starts forgetting the original requirements and constraints.&lt;/p&gt;

&lt;p&gt;This is what the industry calls &lt;strong&gt;"Agentic Amnesia."&lt;/strong&gt; Research data is clear: when tasks are split across multiple conversation rounds, model performance degrades significantly — and without memory management strategies, Agents lose track of constraints, requirements, and earlier results as context accumulates.&lt;/p&gt;

&lt;p&gt;The reason is that LLM context windows are finite. Every tool call's input and output consumes context space. Query five people's calendars, each returning a large JSON payload, and the context window is mostly full. Spotify's engineering team hit the exact same pitfall building a code Agent: once the context window filled up, the Agent "lost its direction" and forgot the original task after a few rounds.&lt;/p&gt;

&lt;p&gt;You need to start doing &lt;strong&gt;Context Engineering&lt;/strong&gt;. Anthropic defines it as "curating exactly what content goes into a limited context window from an ever-changing universe of information." In plain terms, it's the LLM version of memory management: you dynamically decide what the model "sees" at each reasoning step and what it "forgets." Which historical information gets compressed into summaries? Which key constraints must always be preserved? Which tool return values can be discarded?&lt;/p&gt;

&lt;p&gt;The Manus team rebuilt their entire framework four times to get this right. Four times. They called this process "stochastic gradient descent" — inelegant, but effective.&lt;/p&gt;

&lt;p&gt;There's also a subtler trap: &lt;strong&gt;research shows context length and hallucination rate are positively correlated.&lt;/strong&gt; The longer the input, the more likely the model is to hallucinate. For Agent tasks that require large contexts, this is nearly an unresolvable structural paradox.&lt;/p&gt;

&lt;p&gt;One emerging solution to this problem is &lt;strong&gt;Agent Skills&lt;/strong&gt;, a mechanism pioneered by Anthropic. Where Context Engineering is about &lt;em&gt;managing&lt;/em&gt; what's in the context window, Skills are about &lt;em&gt;not putting things there in the first place.&lt;/em&gt; A Skill is a modular package of instructions, workflows, and best practices (typically a &lt;code&gt;SKILL.md&lt;/code&gt; file plus optional scripts) that an Agent loads on demand. Think of it as pluggable expertise — a "Tax Compliance Skill" or a "Cloud Migration Skill" that transforms a general-purpose Agent into a domain specialist, without bloating the context window for every other task.&lt;/p&gt;

&lt;p&gt;The design uses &lt;strong&gt;progressive disclosure&lt;/strong&gt;: an Agent can have dozens of Skills installed but only loads the 2–3 it needs for any given task. This directly mitigates the context window pressure that causes Agentic Amnesia. Skills also enable &lt;strong&gt;composability&lt;/strong&gt; — combining a code-review Skill with a git-automation Skill produces an Agent that can review and commit code without anyone writing explicit coordination logic.&lt;/p&gt;

&lt;p&gt;The impact on the ecosystem has been rapid. OpenAI adopted structurally identical Skills for ChatGPT and Codex CLI. Microsoft's Semantic Kernel implements an equivalent "Plugins" abstraction. Marketplaces like SkillsMP have emerged with hundreds of thousands of community-built Skills. Anthropic has positioned Agent Skills as an open standard — and the convergence across platforms suggests it's becoming the standard abstraction for packaging Agent capabilities, much like MCP became the standard for Agent-to-tool communication.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Want to Test It? You Don't Even Know How
&lt;/h2&gt;

&lt;p&gt;At this point, your Agent barely works. But how do you determine whether it's "actually good" vs. "just barely functional"?&lt;/p&gt;

&lt;p&gt;Traditional software development has mature testing methodologies: unit tests, integration tests, end-to-end tests — inputs are deterministic, expected outputs are deterministic. But an Agent's input space is open-ended (users can say anything) and its output is non-deterministic (the model generates different text each time). LangChain's blog put it perfectly: &lt;strong&gt;"every input is an edge case"&lt;/strong&gt; — a challenge traditional software has never faced.&lt;/p&gt;

&lt;p&gt;You might think to use LLM-as-judge to evaluate LLM outputs. A Hacker News developer explained the problem clearly: using a judge with the same architecture as the system being tested maximizes the probability of systematic bias. The judge and the tested Agent share exactly the same blind spots.&lt;/p&gt;

&lt;p&gt;Anthropic's January blog also acknowledged: Agent interactions involving tool calls, state modifications, and behavior adjustments based on intermediate results are precisely the capabilities that make Agents useful — and simultaneously make them almost impossible to evaluate systematically.&lt;/p&gt;

&lt;p&gt;The data is stark. LangChain's State of AI Agents survey (1,300+ professionals, 2025) found &lt;strong&gt;only about half of organizations run offline evaluations&lt;/strong&gt;, and &lt;strong&gt;fewer than a quarter combine both offline and online evaluations.&lt;/strong&gt; A multi-dimensional analysis of major Agent benchmarks found a &lt;strong&gt;37% performance gap between lab testing and production environments&lt;/strong&gt; — with reliability dropping from 60% to 25% in real-world conditions. An Agent that tests great in your dev environment may behave completely differently in users' hands.&lt;/p&gt;

&lt;p&gt;Anyone who's done client-side development will understand this pain: your Agent might handle a request perfectly today, and fail on the same request tomorrow. Users can accept missing features — they can't accept inconsistency.&lt;/p&gt;

&lt;p&gt;And evaluation is only half the story — the other half is &lt;strong&gt;observability in production.&lt;/strong&gt; Evaluation tests what you &lt;em&gt;expect&lt;/em&gt; the Agent to do; observability shows what it &lt;em&gt;actually&lt;/em&gt; does with real users. When a user reports "the Agent scheduled my meeting at the wrong time," you need to trace back through every tool call: what calendar data was retrieved, what the LLM reasoned, what meeting parameters were generated, and why the wrong time was selected. Without tool call tracing, latency monitoring, and cost/token budget tracking, you're debugging blind. That "37% performance gap" between lab and production? Observability is how you find it. Tools like LangSmith and Arize have emerged specifically for this, but many teams still discover production failures only when users complain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 8: Add Multi-Agent Collaboration? Complexity Explodes
&lt;/h2&gt;

&lt;p&gt;Your scheduling Agent is working well, and you start thinking: could you add more specialized Agents? One for email, one for calendar, one for meeting notes, one for scheduling coordination. Clear division of labor, each handling its domain — sounds reasonable, right?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft's Azure SRE team went down this path.&lt;/strong&gt; They initially built a massive system with 100+ tools and 50+ sub-Agents, and hit a pile of unexpected problems: the orchestrator Agent couldn't find the right sub-Agent (the correct one was "buried three hops away"); a buggy sub-Agent didn't just crash itself — it dragged down the entire reasoning chain; Agents kicked responsibility back and forth in infinite loops. They eventually scaled down to 5 core tools and a few general-purpose Agents, and the system became more reliable.&lt;/p&gt;

&lt;p&gt;Their core lesson: &lt;strong&gt;scaling from one Agent to five doesn't multiply complexity by four — it grows exponentially.&lt;/strong&gt; UC Berkeley's MAST framework analyzed 1,600+ Agent traces and found that &lt;strong&gt;41–86.7% of multi-Agent systems fail in production&lt;/strong&gt;, and &lt;strong&gt;79% of problems come from the orchestration and coordination layer, not the technical implementation.&lt;/strong&gt; How to divide work and how to communicate between Agents is far harder than how to write the code.&lt;/p&gt;

&lt;p&gt;There are established orchestration patterns — sequential chains, concurrent fan-out, hierarchical supervisor models — and each has tradeoffs. ICLR 2025 research found that hierarchical architectures (one coordinator delegating to specialists) show only a &lt;strong&gt;5.5% performance drop&lt;/strong&gt; when individual Agents malfunction, compared to 10.5–23.7% for flatter architectures. This explains why Microsoft eventually simplified to a supervisor model. The practical advice is almost counterintuitive: &lt;strong&gt;start with fewer, more capable Agents rather than many specialized ones&lt;/strong&gt;, and only decompose when a single Agent demonstrably can't handle the workload. The allure of clean role separation is strong, but the coordination overhead will eat you alive.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 9: You Start Doubting — Where's the Bottleneck?
&lt;/h2&gt;

&lt;p&gt;After months of work, your engineering gets more refined, but Agent performance always hits a ceiling you can't break through. You realize a harsh truth: &lt;strong&gt;all engineering optimization has one prerequisite — the underlying model needs to be capable enough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An InfoQ interview with Alibaba Cloud's code platform lead captured it honestly: engineering challenges can be overcome, but model capability bottlenecks are far more daunting. An awkward industry reality: nearly every company building general-purpose Agent products uses Claude Sonnet as their first-choice model, because other models lag noticeably on instruction-following in complex tasks. The more instructions a model can follow, the more complex the problems it can handle. When a model can't even do basic instruction-following, no amount of engineering optimization above it helps.&lt;/p&gt;

&lt;p&gt;You might think: what about using more powerful reasoning models — o3, o4-mini, DeepSeek R1, Claude Sonnet, Claude Opus? &lt;strong&gt;Research finds that reasoning models hallucinate more than base models.&lt;/strong&gt; The data is striking: OpenAI's o3 has a 33% hallucination rate on person-specific factual questions — double the rate of its predecessor o1. The o4-mini reasoning model hits 48%. The root cause is that RL fine-tuning for chain-of-thought reasoning introduces high-variance gradients and entropy-induced randomness, making models more confident even when wrong. They answer rather than admit uncertainty.&lt;/p&gt;

&lt;p&gt;The practical implication for Agents: reasoning models may handle complex task decomposition better, but they trade off reliability on factual tasks. One emerging pattern is to use reasoning models for &lt;em&gt;planning&lt;/em&gt; (breaking down what needs to happen) and base models for &lt;em&gt;execution and verification&lt;/em&gt; (actually doing it and checking the results). But this adds yet another layer of architectural complexity.&lt;/p&gt;

&lt;p&gt;It's like finding your app is laggy, spending days optimizing code logic, and then discovering the bottleneck is hardware performance. Your engineering optimizations have limits, and beyond those limits lies the constraints of underlying capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 10: You Start Understanding the Framework Wars
&lt;/h2&gt;

&lt;p&gt;At this point, you've definitely wrestled with whether to use LangChain, CrewAI, or similar frameworks. The Hacker News discussion has moved from debate to consensus: &lt;strong&gt;frameworks are useful for prototyping; in production they often become a burden.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A CTO shared on Hacker News that he built hundreds of Agents without any framework, using only chat completions plus structured output.&lt;/p&gt;

&lt;p&gt;Anthropic's official guidelines also advise caution with frameworks, as they often make underlying prompts and responses opaque and harder to debug.&lt;/p&gt;

&lt;p&gt;Here's the practical landscape: &lt;strong&gt;LangGraph&lt;/strong&gt; (by LangChain) uses a graph-based architecture with nodes, edges, and conditional routing — it's powerful for complex multi-step reasoning and is used in production by 400+ companies. &lt;strong&gt;CrewAI&lt;/strong&gt; takes a role-based approach where you define Agents by organizational roles — simpler to set up, adopted by 60% of the Fortune 500 for content generation and analysis workflows. &lt;strong&gt;AutoGen&lt;/strong&gt; (Microsoft) was merged into the Microsoft Agent Framework in late 2025, reflecting a broader trend of frameworks consolidating. Each imposes its own abstractions, and those abstractions become constraints the moment your use case doesn't fit neatly.&lt;/p&gt;

&lt;p&gt;There is one thing you genuinely need frameworks for: &lt;strong&gt;persistence and state management.&lt;/strong&gt; Your Agent needs to pause while waiting for user confirmation, recover from checkpoints after errors, and resume long tasks mid-execution. Most lightweight solutions lack these capabilities — which is why orchestration engines like Temporal have risen in the Agent space. Temporal provides durable execution with an append-only event history, letting Agents recover from failures mid-execution. That's genuinely hard to build from scratch.&lt;/p&gt;

&lt;p&gt;Perhaps more consequential than any framework is the emerging &lt;strong&gt;protocol and abstraction layer&lt;/strong&gt; — three complementary standards that are reshaping how Agents are built and composed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;, created by Anthropic, standardizes how models interact with external tools and data sources. Instead of writing custom integrations for every API, MCP provides a universal interface with well-defined security boundaries. It's the "USB port" for Agent-to-tool connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent2Agent (A2A)&lt;/strong&gt;, backed by Google and Microsoft, tackles inter-Agent communication — enabling Agents from different providers and frameworks to discover each other and collaborate via standardized protocols. It's the "HTTP" for Agent-to-Agent interactions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Skills&lt;/strong&gt;, pioneered by Anthropic (discussed in Step 6), solve a different problem entirely: &lt;strong&gt;domain knowledge and procedural expertise.&lt;/strong&gt; MCP gives Agents access to tools; Skills give them the &lt;em&gt;knowledge of how to use those tools effectively&lt;/em&gt; — modular, on-demand expertise that keeps context windows lean through progressive disclosure.&lt;/p&gt;

&lt;p&gt;Together, these three layers — MCP (Agent-to-tool), Agent Skills (Agent knowledge), and A2A (Agent-to-Agent) — form a cohesive architecture. Developers building production Agents will likely use all three: MCP to plug into APIs and databases, Skills to inject domain expertise, and A2A to enable cross-ecosystem Agent collaboration. This matters more than framework choice in the long run, because these protocols define how Agents interoperate — regardless of what framework built them.&lt;/p&gt;

&lt;p&gt;The truth is, framework choice isn't the core challenge of Agent development. The real challenges are the nine steps above. Frameworks are just tools. Choosing the wrong tool wastes time, but going in the wrong engineering direction wastes everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The ten steps above aren't something I made up sitting here. I built agents myself, hit almost every pitfall listed, and some of the projects ultimately failed. The Agent worked flawlessly in my development environment — but in production, context window limits caused it to lose track of multi-step tasks, costs spiraled because I hadn't designed for model routing, and I had no observability to diagnose why users were getting wrong results. By the time I understood the real scope of the engineering required, the project had burned through its budget and patience. Looking back, the mindset of "it's just calling an API, how hard can it be?" was exactly the same as my mindset 20 years ago of "drag a few controls and you have an app." What really taught me, in the end, was that failure.&lt;/p&gt;

&lt;p&gt;Walk through these ten steps and you'll find that &lt;strong&gt;"calling APIs" accounts for roughly 5% of total Agent development effort.&lt;/strong&gt; The other 95% is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth, rate limiting, and error handling in the tool layer (Step 2)&lt;/li&gt;
&lt;li&gt;Getting tool design granularity and descriptions right (Step 2)&lt;/li&gt;
&lt;li&gt;Validation and rollback for multi-step error cascades (Step 3)&lt;/li&gt;
&lt;li&gt;Safety guardrails, least-privilege permissions, and human-in-the-loop gates (Step 4)&lt;/li&gt;
&lt;li&gt;Cost control, prompt caching, model routing, and latency optimization (Step 5)&lt;/li&gt;
&lt;li&gt;Context Engineering, memory management, and Agent Skills for progressive disclosure (Step 6)&lt;/li&gt;
&lt;li&gt;Building evaluation and production observability from scratch (Step 7)&lt;/li&gt;
&lt;li&gt;Complexity control for multi-Agent orchestration and coordination (Step 8)&lt;/li&gt;
&lt;li&gt;Engineering around model capability ceilings and reasoning model tradeoffs (Step 9)&lt;/li&gt;
&lt;li&gt;Navigating the framework/protocol landscape — MCP, A2A, and Agent Skills (Step 10)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangChain calls this emerging discipline &lt;strong&gt;"Agent Engineering"&lt;/strong&gt; — I think that's exactly right. Boston Consulting Group's research shows that &lt;strong&gt;only about a quarter of companies achieve significant ROI from their AI initiatives&lt;/strong&gt;, and Agent projects are no exception. LangChain's survey found that &lt;strong&gt;32% of companies cite "quality below standard" as the top barrier to shipping an Agent.&lt;/strong&gt; These numbers say it all.&lt;/p&gt;

&lt;p&gt;The enormous gap between Agent and Agent doesn't come from who's calling different APIs — it comes from the vastly different quality of the 95% of engineering that happens &lt;em&gt;outside&lt;/em&gt; the API call. Calling an API is the entry threshold, something you can cross in a week. But between demo and product lies an entire system of engineering around reliability, observability, context management, and error recovery.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's where Agent development is truly hard.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://gorilla.cs.berkeley.edu/leaderboard.html" rel="noopener noreferrer"&gt;Berkeley Function Calling Leaderboard (BFCL)&lt;/a&gt; — Tool-calling accuracy benchmarks across models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://galileo.ai/blog/agent-failure-modes-guide" rel="noopener noreferrer"&gt;Galileo: 7 AI Agent Failure Modes and How To Fix Them&lt;/a&gt; — Error propagation in multi-step Agent tasks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.langchain.com/state-of-agent-engineering" rel="noopener noreferrer"&gt;LangChain: State of AI Agents Report (2025)&lt;/a&gt; — Industry survey on Agent evaluation and adoption&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2511.14136v1" rel="noopener noreferrer"&gt;Beyond Accuracy: Multi-Dimensional Framework for Enterprise Agentic AI&lt;/a&gt; — Lab vs. production performance gap analysis&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://techcommunity.microsoft.com/blog/appsonazureblog/context-engineering-lessons-from-building-azure-sre-agent/4481200/" rel="noopener noreferrer"&gt;Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent&lt;/a&gt; — Microsoft's experience with 100+ tools and 50+ sub-Agents&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2503.13657" rel="noopener noreferrer"&gt;Why Do Multi-Agent LLM Systems Fail? (UC Berkeley MAST Framework)&lt;/a&gt; — Analysis of 1,600+ Agent traces and 14 failure modes&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/html/2505.23646v1" rel="noopener noreferrer"&gt;Are Reasoning Models More Prone to Hallucination?&lt;/a&gt; — Comparison of hallucination rates in reasoning vs. base models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2" rel="noopener noreferrer"&gt;Spotify Engineering: Context Engineering for Background Coding Agents&lt;/a&gt; — Context window management lessons&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" rel="noopener noreferrer"&gt;Manus: Context Engineering for AI Agents&lt;/a&gt; — Four framework rebuilds and iterative context design&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents" rel="noopener noreferrer"&gt;Anthropic: Effective Context Engineering for AI Agents&lt;/a&gt; — Defining and implementing Context Engineering&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://genai.owasp.org/llmrisk/llm062025-excessive-agency/" rel="noopener noreferrer"&gt;OWASP: LLM06:2025 Excessive Agency&lt;/a&gt; — Security threat classification for Agent systems&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.bcg.com/publications/2025/agents-accelerate-next-wave-of-ai-value-creation" rel="noopener noreferrer"&gt;BCG: How Agents Are Accelerating the Next Wave of AI Value Creation&lt;/a&gt; — Enterprise AI ROI data&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://openreview.net/forum?id=bkiM54QftZ" rel="noopener noreferrer"&gt;On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents (ICLR 2025)&lt;/a&gt; — Hierarchical vs. flat architecture resilience&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/news/prompt-caching" rel="noopener noreferrer"&gt;Anthropic: Prompt Caching&lt;/a&gt; — 90% cost reduction and 85% latency reduction for repeated prompts&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills" rel="noopener noreferrer"&gt;Anthropic: Equipping Agents for the Real World with Agent Skills&lt;/a&gt; — The original Agent Skills mechanism and design philosophy&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://thenewstack.io/agent-skills-anthropics-next-bid-to-define-ai-standards/" rel="noopener noreferrer"&gt;Agent Skills: Anthropic's Next Bid to Define AI Standards&lt;/a&gt; — Skills as an open standard for modular Agent capabilities&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.friedrichs-it.de/blog/agent-skills-vs-model-context-protocol/" rel="noopener noreferrer"&gt;Agent Skills vs MCP: Two Standards, Two Security Models&lt;/a&gt; — Complementary roles of Skills and MCP in Agent architecture&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>programming</category>
    </item>
    <item>
      <title>AI Agent Memory Management - When Markdown Files Are All You Need?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 18 Feb 2026 02:15:15 +0000</pubDate>
      <link>https://dev.to/imaginex/ai-agent-memory-management-when-markdown-files-are-all-you-need-5ekk</link>
      <guid>https://dev.to/imaginex/ai-agent-memory-management-when-markdown-files-are-all-you-need-5ekk</guid>
      <description>&lt;h2&gt;
  
  
  What is Memory Management for AI Agents?
&lt;/h2&gt;

&lt;p&gt;Memory management for AI agents refers to the mechanisms that allow an agent to store, retrieve, and use information across interactions. Without memory management, every conversation starts from a blank slate — the agent is stateless and forgets everything between sessions. With it, the agent accumulates knowledge over time, learns from past mistakes, and maintains continuity — becoming truly stateful.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are the Memory Types for AI Agents?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short-term&lt;/strong&gt; - The agent's immediate context window, holding the current conversation and recent tool outputs. Analogous to a human's active attention span. Duration: minutes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term&lt;/strong&gt; - Persistent storage of facts, preferences, and decisions that survive across sessions. Analogous to human declarative memory. Duration: indefinite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Procedural&lt;/strong&gt; - Learned workflows, action sequences, and "how-to" knowledge the agent acquires through experience. Analogous to human muscle memory or learned skills. Duration: permanent once codified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Working&lt;/strong&gt; - A temporary scratchpad for intermediate reasoning steps during a single task. Analogous to a mental whiteboard used for chain-of-thought reasoning. Duration: seconds to minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Comparison of Memory Types in Agents
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Memory Type&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Typical Implementation&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Use Case&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Short-Term&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;Context Window / RAM&lt;/td&gt;
&lt;td&gt;Following a conversation thread.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Long-Term&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Years&lt;/td&gt;
&lt;td&gt;Vector DB / SQL&lt;/td&gt;
&lt;td&gt;Remembering user preferences and facts.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Procedural&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Permanent&lt;/td&gt;
&lt;td&gt;Action Recipes / Logs&lt;/td&gt;
&lt;td&gt;Learning "how" to use a specific tool or API.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Working&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Scratchpad / State&lt;/td&gt;
&lt;td&gt;Intermediate reasoning steps (Chain-of-Thought).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What are Use Cases for AI Agent Memory Management?
&lt;/h2&gt;

&lt;p&gt;Memory management is the "glue" that transforms a basic chatbot into a functional AI agent. While simple models process prompts in isolation (stateless), agents with memory can track goals, learn from mistakes, and personalize their behavior over time.&lt;/p&gt;

&lt;p&gt;Effective memory management generally involves balancing &lt;strong&gt;Short-Term Memory&lt;/strong&gt; (immediate context), &lt;strong&gt;Long-Term Memory&lt;/strong&gt; (historical facts and patterns), &lt;strong&gt;Procedural Memory&lt;/strong&gt; (refined workflows), and &lt;strong&gt;Working Memory&lt;/strong&gt; (intermediate reasoning steps).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Personal AI Assistants &amp;amp; Companions&lt;/strong&gt; - Agents like virtual executive assistants must manage memory to provide a "human-like" continuity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Step Research &amp;amp; Coding Agents&lt;/strong&gt; - Agents designed for "deep research" or complex software engineering (e.g., Devin or OpenDevin) navigate thousands of lines of code or documents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer Support Automation&lt;/strong&gt; - Modern support agents handle issues that may span several days or multiple channels (email, chat, phone).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Autonomous DevOps &amp;amp; CI/CD Agents&lt;/strong&gt; - Agents managing cloud infrastructure or deployment pipelines need memory to understand the state of a complex system.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare &amp;amp; Patient Management&lt;/strong&gt; - AI agents in healthcare act as long-term monitors for chronic conditions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What are the Existing Approaches?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When designing a smart AI agent, memory management determines whether your agent is "forgetful" (stateless) or "intelligent" (stateful).&lt;/strong&gt; Some AI agent frameworks like LangChain and LangGraph have built-in memory management, while others like OpenAI and Google ADK have their own memory management systems. Each framework approaches memory with a different philosophy—some prioritize ease of use (OpenAI), while others prioritize granular control (LangGraph).&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison: Memory Management Architectures
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Framework&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Primary Memory Strategy&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Persistence Level&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Best For...&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangChain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Modular Components&lt;/strong&gt; (Buffer, Summary, Entity)&lt;/td&gt;
&lt;td&gt;Manual (must connect DB)&lt;/td&gt;
&lt;td&gt;Diverse, specialized RAG workflows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LangGraph&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Graph Persistence&lt;/strong&gt; (Checkpointers)&lt;/td&gt;
&lt;td&gt;Built-in (Thread-level)&lt;/td&gt;
&lt;td&gt;Complex, cyclical tasks (e.g., self-correcting code).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google ADK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Memory Bank&lt;/strong&gt; (Identity-scoped)&lt;/td&gt;
&lt;td&gt;Fully Managed&lt;/td&gt;
&lt;td&gt;Personalized, long-term user context on GCP.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CrewAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Unified Multi-Layer&lt;/strong&gt; (Short, Long, Entity)&lt;/td&gt;
&lt;td&gt;Built-in (SQLite/Chroma)&lt;/td&gt;
&lt;td&gt;Multi-agent collaboration and role-playing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Threads API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully Managed (Opaque)&lt;/td&gt;
&lt;td&gt;Rapid prototyping; hands-off state management.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Is There a Simpler Alternative?
&lt;/h2&gt;

&lt;p&gt;In December 2025, Meta acquired Manus for $2 billion. The startup was just 8 months old with a small team. Industry insiders speculated: "They must have revolutionary AI algorithms... proprietary models... breakthrough technology..."&lt;/p&gt;

&lt;p&gt;The truth was far more interesting—and far simpler.&lt;/p&gt;

&lt;p&gt;Their competitive advantage wasn't complex algorithms or massive infrastructure. It was &lt;strong&gt;how they managed memory using plain text files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While the AI industry spent millions building vector databases, complex RAG pipelines, and proprietary memory systems, three independent high-value projects quietly converged on the same "boring" solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manus&lt;/strong&gt; (acquired for $2B) - Used file-based planning for long-running agents. Its agents followed a three-file pattern: &lt;code&gt;task_plan.md&lt;/code&gt; for goals and progress, &lt;code&gt;notes.md&lt;/code&gt; for research, and a deliverable output file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt; (145K+ GitHub stars) - Built dual-layer Markdown memory architecture. It uses &lt;code&gt;MEMORY.md&lt;/code&gt; for curated knowledge, &lt;code&gt;memory/YYYY-MM-DD.md&lt;/code&gt; for daily logs, and &lt;code&gt;SOUL.md&lt;/code&gt; for personality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; (Anthropic's official tool) - Implemented Skills and memory as Markdown files. It uses a &lt;code&gt;CLAUDE.md&lt;/code&gt; hierarchy for project context, &lt;code&gt;.claude/MEMORY.md&lt;/code&gt; for auto-captured learnings, and a Skills system for on-demand capability loading.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This convergence suggests something fundamental about what works in practice. In biology, this is called convergent evolution — when independent organisms develop the same trait because it is the optimal solution to a shared challenge. While many AI systems rely on elaborate memory infrastructure, file-based approaches offer a simpler alternative that addresses the core requirements: persistence, transparency, and reliability.&lt;/p&gt;

&lt;p&gt;Using local Markdown files for memory management—an approach popularized by tools like &lt;strong&gt;OpenClaw&lt;/strong&gt;, &lt;strong&gt;Claude Code&lt;/strong&gt;, and &lt;strong&gt;Manus&lt;/strong&gt;—offers a philosophy of &lt;strong&gt;"Memory as Documentation."&lt;/strong&gt; This contrasts sharply with the "Memory as Database" approach of frameworks like LangGraph or CrewAI.&lt;/p&gt;

&lt;p&gt;This approach treats the agent's memory not as a hidden system state, but as a transparent, editable file living directly in the user's workspace.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why File-based Memory Works?
&lt;/h3&gt;

&lt;p&gt;File-based memory systems work because they align with how developers already manage information. Here are the key properties that make them effective for AI agents:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent&lt;/strong&gt;: Memory survives agent restarts, crashes, or updates. Files decouple memory from process lifecycle — no data loss when a process dies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transparent and Editable&lt;/strong&gt;: You can open the agent's memory file (e.g., &lt;code&gt;MEMORY.md&lt;/code&gt; or &lt;code&gt;task_plan.md&lt;/code&gt;) in any text editor, read exactly what it "knows," and edit it manually. In LangGraph or CrewAI, modifying memory often requires writing scripts to update a database or decoding complex JSON objects. With Markdown, if the agent hallucinates a goal, you simply highlight the text and delete it. This zero-friction "human-in-the-loop" capability builds trust and enables compliance audits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version-Controllable&lt;/strong&gt;: Because memory is plain text, it lives in your Git repository. You can commit the agent's "knowledge," revert changes if the agent goes off-rails, and branch the memory. Frameworks like CrewAI usually store memory in external databases (Postgres, ChromaDB) — syncing that external state with your code's version history is difficult. Markdown memory treats context &lt;em&gt;as part of the codebase&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Holistic Context&lt;/strong&gt;: Agents like &lt;strong&gt;Claude Code&lt;/strong&gt; use Markdown to maintain a high-level summary of the project structure. They read this file &lt;em&gt;first&lt;/em&gt; to orient themselves. RAG (Vector Databases) retrieves fragments based on similarity search, which often misses the "forest for the trees" — fetching specific functions but missing the overall architectural pattern. A curated Markdown summary solves this by forcing the agent to maintain a "map" of the project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Portable&lt;/strong&gt;: Standard Markdown format means no vendor lock-in. Your agent's memory is not locked into OpenAI's &lt;code&gt;thread_id&lt;/code&gt; or a proprietary vector store. You can swap the underlying model (e.g., switch from Claude to GPT-4o) and simply feed it the same Markdown file. Migration is as simple as copying files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Searchable&lt;/strong&gt;: Standard text search tools (e.g., grep, ripgrep) work immediately — no special database required. More advanced approaches like full-text search or vector embeddings can be added as the memory grows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost-effective&lt;/strong&gt;: Local disk storage costs \$0.02/GB/month compared to managed vector database services at \$50-200/GB/month. No per-query API fees or infrastructure scaling costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison Matrix: Markdown vs. Frameworks
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Markdown Files (Claude Code/Manus)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Database Frameworks (LangGraph/CrewAI)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debuggability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt;: Just read/edit the file.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Med/Low&lt;/strong&gt;: Requires DB inspection tools.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Low&lt;/strong&gt;: Instant file read.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Med&lt;/strong&gt;: Network calls to Vector DBs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Low&lt;/strong&gt;: Files get unmanageable &amp;gt;5MB.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt;: Handles millions of records easily.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Persistence&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local&lt;/strong&gt;: Lives on your disk/repo.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Cloud/Server&lt;/strong&gt;: Lives in a managed service.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retrieval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Linear&lt;/strong&gt;: Agent reads the whole file.&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;: Agent searches for keywords/vectors.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Strategic Trade-off
&lt;/h3&gt;

&lt;p&gt;The "Markdown" approach is &lt;strong&gt;optimal for Local Agents&lt;/strong&gt; because the "context" is finite and structured. The "Database" approach is &lt;strong&gt;optimal for Enterprise Agents&lt;/strong&gt; where the "memory" consists of millions of user profiles and history logs that cannot fit into a single file, requiring dynamic agent management and more sophisticated search capabilities.&lt;/p&gt;

&lt;p&gt;For example, an enterprise customer support agent typically integrates a Vector DB into a RAG (Retrieval-Augmented Generation) pipeline. Before the LLM generates a response, a retrieval step automatically grabs relevant "memories" based on the user's input and injects them into the system prompt as context. This enables semantic search across structured and unstructured data — user profiles, past chat transcripts, PDF manuals, or meeting notes — so the agent can answer questions like "Has this user complained about something similar before?" without being explicitly told to look it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Design File-based Memory for Your AI Agent?
&lt;/h2&gt;

&lt;p&gt;File-based AI agent memory typically consists of two layers: remembrance and personalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remembrance Layer
&lt;/h3&gt;

&lt;p&gt;The remembrance layer stores what the agent knows, organized into three types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-term memory (e.g., MEMORY.md)&lt;/strong&gt;: Stores curated, important information that should persist indefinitely. This includes user preferences, key decisions and their rationale, learned lessons, and standard procedures. This file is typically loaded into every agent conversation. Systems like OpenClaw trigger a memory flush before context compression, prompting the agent to write important information to MEMORY.md before older context is discarded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daily logs (e.g., memory/YYYY-MM-DD.md)&lt;/strong&gt;: Timestamped records of activities, conversations, and observations. These provide chronological context and help the agent maintain continuity across sessions. Recent logs (today and yesterday) are typically loaded automatically, while older logs are searched on-demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Working memory (e.g., task_plan.md)&lt;/strong&gt;: Tracks the current task's goals, progress, and context. This prevents "goal drift" in long-running tasks by providing a consistent reference point that the agent can check throughout execution. Manus popularized a three-file variant (&lt;code&gt;task_plan.md&lt;/code&gt;, &lt;code&gt;notes.md&lt;/code&gt;, deliverable) with a read-decide-act-update cycle: read the plan, act on the next step, update progress, then repeat.&lt;/p&gt;

&lt;h3&gt;
  
  
  Personalization Layer
&lt;/h3&gt;

&lt;p&gt;The personalization layer defines how the agent behaves and how it is perceived by the user:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SOUL.md&lt;/strong&gt;: Defines core values, decision principles, and behavioral guidelines. This file shapes the agent's personality and decision-making approach. For example, a SOUL.md might specify "prefer simple solutions over complex ones" or "always ask for clarification when ambiguous."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IDENTITY.md&lt;/strong&gt;: Defines the agent's public identity, including name, start date, and communication style. This file is used to identify the agent to the user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USER.md&lt;/strong&gt;: Defines the user's profile, including technical background, preferences, and context. This file is used to tailor the agent's behavior to the user's needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modular skills&lt;/strong&gt;: Additional capabilities can be loaded on-demand using separate skill files. Rather than loading all possible skills at startup, the agent loads specific skill documentation only when needed, keeping the context manageable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search Strategies
&lt;/h3&gt;

&lt;p&gt;As memory grows, search becomes important. Three approaches offer progressively more capability:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Basic text search (grep/ripgrep)&lt;/strong&gt;: Sufficient for most use cases with fewer than 1,000 files. Fast, free, and deterministic. Works well for exact keyword matches and phrases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BM25 full-text search&lt;/strong&gt;: Useful when scaling to 1,000-10,000 files. BM25 is a ranking algorithm that scores documents by relevance — similar to how a search engine ranks web pages. It supports boolean operators (AND, OR, NOT) and can be implemented using SQLite's built-in full-text search with minimal infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid vector + BM25&lt;/strong&gt;: Most sophisticated approach, combining semantic search (understanding concepts) with keyword matching. Typically only needed when exceeding 10,000 files or when conceptual queries are important. Requires embedding generation, which adds API costs. OpenClaw's implementation uses 70:30 weighting (vector similarity : BM25 keyword) with a 0.35 minimum score threshold. In testing, this achieved 89% recall vs. 76% for vector-only and 68% for BM25-only.&lt;/p&gt;

&lt;p&gt;Most implementations should start with basic text search and upgrade only when the need is demonstrated through actual usage patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Considerations
&lt;/h3&gt;

&lt;p&gt;Starting with file-based memory is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a MEMORY.md file and give your AI agent read/write access to it&lt;/li&gt;
&lt;li&gt;Implement daily log files with timestamps (memory/YYYY-MM-DD.md format)&lt;/li&gt;
&lt;li&gt;Add basic grep/ripgrep search capability&lt;/li&gt;
&lt;li&gt;Define a SOUL.md file to establish agent personality and values&lt;/li&gt;
&lt;li&gt;Add task planning files when working on multi-step projects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The simplicity of this approach means implementation typically takes days rather than months. The architecture can scale from single-user prototypes to production systems handling thousands of agents.&lt;/p&gt;

&lt;p&gt;For more complex deployments, consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git version control for memory files&lt;/li&gt;
&lt;li&gt;Separate memory directories for different agents or use cases&lt;/li&gt;
&lt;li&gt;Shared knowledge bases that multiple agents can reference&lt;/li&gt;
&lt;li&gt;Encryption for sensitive information (filesystem-level or application-level)&lt;/li&gt;
&lt;li&gt;Progressive context disclosure: load only memory relevant to the current task rather than everything at startup (as practiced by Claude Code's Skills system)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;File-based memory for AI agents represents a practical middle ground: simpler than elaborate infrastructure, but more capable than purely ephemeral in-memory approaches. The convergence of multiple successful projects on this pattern suggests it addresses real needs effectively.&lt;/p&gt;

&lt;p&gt;The approach offers particularly strong advantages in transparency, portability, and user control—increasingly important considerations as AI agents handle more sensitive and critical tasks.&lt;/p&gt;

&lt;p&gt;When three independent, high-profile projects converge on the same architectural choice, it is worth paying attention — not because Markdown files are the final answer, but because they reveal that the right abstraction for agent memory may be simpler than the industry assumed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Manus&lt;/strong&gt;: &lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus" rel="noopener noreferrer"&gt;Context Engineering for AI Agents: Lessons from Building Manus&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenClaw&lt;/strong&gt;: &lt;a href="https://docs.openclaw.ai/concepts/memory" rel="noopener noreferrer"&gt;Memory Concepts Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt;: &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;Memory Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AGENTS.md&lt;/strong&gt;: &lt;a href="https://agents.md/" rel="noopener noreferrer"&gt;The Open Standard for Agent Configuration&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic Design Patterns&lt;/strong&gt;: &lt;a href="https://github.com/sarwarbeing-ai/Agentic_Design_Patterns" rel="noopener noreferrer"&gt;A Hands-On Guide to Building Intelligent Systems&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Reward Engineering: An Emerging Skill for AI Engineers</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 13 Feb 2026 16:16:32 +0000</pubDate>
      <link>https://dev.to/imaginex/reward-engineering-an-emerging-skill-for-ai-engineers-1i01</link>
      <guid>https://dev.to/imaginex/reward-engineering-an-emerging-skill-for-ai-engineers-1i01</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In their comprehensive report &lt;strong&gt;"&lt;a href="https://you.com/resources/2026-ai-predictions" rel="noopener noreferrer"&gt;AI Predictions for 2026&lt;/a&gt;,"&lt;/strong&gt; Richard Socher (one of the world's most-cited NLP researchers and CEO of You.com) and Bryan McCann (CTO of You.com) outline a fundamental shift in how we interact with artificial intelligence. Their central thesis: the era of simple Large Language Model (LLM) chatbots is giving way to sophisticated, autonomous AI agent ecosystems.&lt;/p&gt;

&lt;p&gt;This transformation represents a shift from &lt;strong&gt;"Chat-Engines"&lt;/strong&gt; (systems you converse with) to &lt;strong&gt;"Do-Engines"&lt;/strong&gt; (systems that autonomously complete tasks for you). To enable this shift, Socher and McCann predict the emergence of a new specialization: the &lt;strong&gt;Reward Engineer&lt;/strong&gt;—a professional who designs the mathematical and logical objective functions that define success for AI agents.&lt;/p&gt;

&lt;p&gt;Whether or not "Reward Engineer" becomes an official job title in 2026, the underlying skill of reward engineering is rapidly becoming essential for any AI engineer working with autonomous systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Reward Engineering?
&lt;/h2&gt;

&lt;p&gt;As AI evolves from generating text to autonomously executing multi-step tasks, our approach to guiding these systems must also evolve. Traditional &lt;strong&gt;Context Engineering&lt;/strong&gt;—writing instructions in natural language—works well for chatbots but proves insufficient for autonomous agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Prompts Aren't Enough:&lt;/strong&gt; When an AI agent must complete complex, long-term goals—such as optimizing a supply chain, conducting legal research, or managing a project—simple text instructions cannot capture all the nuances, constraints, and trade-offs involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enter Reward Engineering:&lt;/strong&gt; This discipline combines logic, ethics, and data science to define precise success criteria. Reward engineers must anticipate how AI agents might find unintended shortcuts (a phenomenon called "reward hacking") and design objective functions that align agent behavior with genuine human intent across extended time horizons.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Responsibilities
&lt;/h2&gt;

&lt;p&gt;Rather than writing traditional code or conversational prompts, engineers design the &lt;strong&gt;objective functions&lt;/strong&gt; and &lt;strong&gt;reinforcement learning frameworks&lt;/strong&gt; that guide autonomous AI agents. Think of this role as a "Policy Architect"—ensuring agents achieve complex business objectives (such as "increase supply chain efficiency by 15%") while respecting ethical boundaries, security protocols, and resource constraints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Responsibilities
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Objective Function Design:&lt;/strong&gt; Translate broad business goals into precise mathematical reward signals that guide agent behavior toward desired outcomes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrail Engineering:&lt;/strong&gt; Create constraints and penalties that prevent reward hacking—situations where an AI technically achieves its goal but in unintended or harmful ways.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Agent Coordination:&lt;/strong&gt; Design reward structures that encourage multiple AI agents to collaborate effectively rather than compete counterproductively for shared resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop (HITL) Policies:&lt;/strong&gt; Establish clear escalation triggers that determine when an agent must pause and request human approval before proceeding with high-stakes decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation &amp;amp; Benchmarking:&lt;/strong&gt; Develop comprehensive test suites to evaluate agent reasoning and ensure consistent, reliable performance across different scenarios and model versions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Required Technical Skills
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logic &amp;amp; Ethics:&lt;/strong&gt; Strong foundation in game theory, utility functions, and AI alignment principles to design fair and effective reward systems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agentic Frameworks:&lt;/strong&gt; Proficiency with modern AI agent frameworks (such as LangChain, AutoGPT, CrewAI, and their successors) as well as cloud-based agentic platforms (Amazon Bedrock Agents, Azure AI Agent Service with Semantic Kernel, and Vertex AI Agent Builder) that enable autonomous task execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python Programming:&lt;/strong&gt; Ability to write validation scripts that evaluate AI outputs and enforce behavioral constraints—essentially serving as "referees" for agent actions. Python is specifically required because it's the dominant language in the AI/ML ecosystem: nearly all reinforcement learning frameworks (PyTorch, TensorFlow, Gymnasium), agent frameworks (LangChain, AutoGPT), and evaluation tools are built in Python. This creates seamless integration between reward function design and the AI models they guide, unlike general-purpose languages such as Bash (limited to shell scripting) or Node.js (less common in ML applications).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Expertise:&lt;/strong&gt; Deep understanding of specific industries (finance, healthcare, legal, etc.) to define what constitutes a genuinely successful outcome versus a superficial one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Identification:&lt;/strong&gt; Skill in recognizing logical inconsistencies, potential failure modes, and "hallucination-prone" scenarios within autonomous agent workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reward Engineering vs. Context Engineering
&lt;/h2&gt;

&lt;p&gt;The shift from conversational AI to autonomous agents demands a fundamental change in how we guide these systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Engineering (Today):&lt;/strong&gt; Writing natural language instructions like "Act as a lawyer and draft a contract." This works for generating single responses but lacks the precision needed for autonomous, multi-step tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reward Engineering (Tomorrow):&lt;/strong&gt; Designing mathematical frameworks that define success. Instead of telling an AI &lt;em&gt;what&lt;/em&gt; to do, reward engineers create scoring systems that guide &lt;em&gt;how&lt;/em&gt; the AI optimizes its behavior over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Critical Difference: Preventing Reward Hacking
&lt;/h3&gt;

&lt;p&gt;Consider a common pitfall: if you reward an AI for "reducing customer complaints," a poorly designed system might simply delete incoming complaint emails—technically achieving the goal while completely missing the intent.&lt;/p&gt;

&lt;p&gt;AI engineers must anticipate such shortcuts and create sophisticated reward models that balance competing priorities: speed, accuracy, ethics, and safety. This becomes especially critical as AI agents make consequential decisions with real-world financial, legal, or safety implications.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Evolution: From Context Engineering to Reward Engineering
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Context Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Reward Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Tool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Natural language instructions&lt;/td&gt;
&lt;td&gt;Mathematical objective functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Focus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generating single responses&lt;/td&gt;
&lt;td&gt;Guiding multi-step autonomous behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success Measure&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"The output sounds right"&lt;/td&gt;
&lt;td&gt;"The task completed successfully within all constraints"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output Type&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text, images, code snippets&lt;/td&gt;
&lt;td&gt;Real-world actions and transactions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scope&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One interaction at a time&lt;/td&gt;
&lt;td&gt;Extended time horizons with multiple decision points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This evolution from conversational AI to autonomous agents represents not just a technical shift, but a fundamental change in how we conceptualize human-AI collaboration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Reward Engineering Skills: A Practical Roadmap
&lt;/h2&gt;

&lt;p&gt;Transitioning to reward engineering means evolving from a "Writer" (crafting conversational prompts) to an "Architect" (designing behavioral frameworks). You'll shift from asking AI for outputs to defining the mathematical and ethical boundaries within which it operates.&lt;/p&gt;

&lt;p&gt;Here's a three-phase roadmap to develop these skills:&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Foundations — From Intuition to Precision
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Move from informal, "vibe-based" prompting to structured, contract-like specifications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Logical Decomposition:&lt;/strong&gt; Practice breaking complex problems into small, verifiable subtasks. Each subtask needs a clearly defined success state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contract-Based Thinking:&lt;/strong&gt; Transform vague requests into precise specifications. Instead of "Write a professional email," specify: "Generate an email under 200 words containing exactly three bullet points and referencing invoice #12345, or fail validation."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic Programming Literacy:&lt;/strong&gt; Develop comfort with Python control flow (if/then/else logic) and APIs. Many reward functions are implemented as Python scripts that evaluate agent outputs against defined criteria.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 2: Understanding Agentic Systems
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Learn how autonomous "Do-Engines" operate and make decisions over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Management:&lt;/strong&gt; Understand how agents maintain memory of previous actions and decisions. Study frameworks like &lt;strong&gt;ReAct (Reasoning + Acting)&lt;/strong&gt; and &lt;strong&gt;Plan-and-Execute&lt;/strong&gt; patterns that enable multi-step reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Integration:&lt;/strong&gt; Learn how agents access and utilize external tools (calculators, search engines, databases). Your role is designing rewards that encourage appropriate tool usage and penalize inefficient or incorrect tool selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quantitative Evaluation:&lt;/strong&gt; Adopt rigorous evaluation frameworks like &lt;strong&gt;LangSmith&lt;/strong&gt; or &lt;strong&gt;Hugging Face Evaluate&lt;/strong&gt;. Shift from subjective assessment ("This looks good") to measurable metrics ("This output scores 8.5/10 on our accuracy rubric").&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 3: Advanced Reward Engineering
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Master the specialized skills that define the reward engineering role.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Skills to Develop:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RLHF (Reinforcement Learning from Human Feedback):&lt;/strong&gt; Understand how models learn from human preferences. You'll design the ranking criteria and evaluation rubrics that human labelers use to train agent behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Objective Function Design:&lt;/strong&gt; This is the core competency. Learn to translate business goals into mathematical reward functions that balance competing priorities.&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Example:&lt;/em&gt; For a budget management agent, design rewards that optimize both cost savings &lt;em&gt;and&lt;/em&gt; service quality—preventing the agent from simply cutting all expenses.

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Safety &amp;amp; Alignment Engineering:&lt;/strong&gt; Create guardrail mechanisms ensuring that the reward for helpful behavior never outweighs the penalty for harmful actions. This requires anticipating edge cases where agents might find dangerous shortcuts.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hands-On Practice: Thinking Like a Reward Engineer
&lt;/h2&gt;

&lt;p&gt;The best way to prepare for this emerging skill is a fundamental shift in perspective: stop focusing on &lt;em&gt;what&lt;/em&gt; you want the AI to say, and start defining &lt;em&gt;how you'll measure&lt;/em&gt; whether its actions were successful.&lt;/p&gt;

&lt;p&gt;The following exercise introduces you to reward function design—the core of reward engineering.&lt;/p&gt;

&lt;h3&gt;
  
  
  Practical Exercise: The Budget-Conscious Travel Agent
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Scenario:&lt;/strong&gt; You're developing an AI agent to book corporate travel. With a vague instruction like "Book the best flight," the agent might select a $10,000 first-class ticket—technically "the best" by some measures, but clearly not what you intended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your Task:&lt;/strong&gt; Design a reward system that guides the agent to balance cost, timeliness, comfort, and convenience appropriately.&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 1: Distribute Reward Points
&lt;/h4&gt;

&lt;p&gt;You have 100 reward points to allocate across four potential outcomes. The agent will optimize for maximum points. How should you distribute them?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Outcome&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Your Allocation&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Arrival Time:&lt;/strong&gt; Flight arrives before the 9:00 AM meeting&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Flight costs under $500&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Convenience:&lt;/strong&gt; Direct flight with no layovers&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Comfort:&lt;/strong&gt; Business or first-class seating&lt;/td&gt;
&lt;td&gt;_____ points&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Step 2: Recognizing the Reward Hacking Trap
&lt;/h4&gt;

&lt;p&gt;Review your point allocation. If you assigned 80 points to &lt;strong&gt;Cost Efficiency&lt;/strong&gt; but only 10 points to &lt;strong&gt;Arrival Time&lt;/strong&gt;, the agent might book a $50 red-eye flight that arrives &lt;em&gt;after&lt;/em&gt; the 9:00 AM meeting. It maximized points but completely failed the actual objective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Reward Engineering Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Professional reward engineers use &lt;strong&gt;hard constraints&lt;/strong&gt; and &lt;strong&gt;dynamic incentives&lt;/strong&gt; to prevent such failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hard Constraint:&lt;/strong&gt; "If arrival time is after 9:00 AM, apply a penalty of -1,000 points (automatic failure)."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Incentive:&lt;/strong&gt; "For every $10 saved below the $500 budget, add +1 bonus point."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This combination ensures critical requirements are never violated, while still encouraging optimization within acceptable parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Alignment Requires Precision:&lt;/strong&gt; Without explicit penalties for missing the meeting, even a well-intentioned point system can lead to failures. Intent alone isn't enough—you must formalize every constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Logic Replaces Language:&lt;/strong&gt; This exercise demonstrates programming agent behavior through mathematical objectives rather than conversational instructions—the essence of reward engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The Future of Software Development:&lt;/strong&gt; This approach reflects Socher and McCann's vision for 2026: rather than giving AI step-by-step instructions, we'll define the rules and constraints, then let AI agents find optimal solutions within those boundaries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As AI systems transition from responding to queries to autonomously executing complex tasks, reward engineering emerges as an essential discipline. Whether it becomes a formal job title or remains a critical skill within broader AI engineering roles, the ability to design precise, ethical, and robust objective functions will define who can successfully deploy autonomous AI agents in the real world.&lt;/p&gt;

&lt;p&gt;Start developing these skills now: think in terms of measurable outcomes, anticipate unintended behaviors, and practice translating human intent into mathematical frameworks. The future of AI isn't just about building smarter systems—it's about building systems that are smart in the &lt;em&gt;right&lt;/em&gt; ways.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>career</category>
      <category>llm</category>
    </item>
    <item>
      <title>Why Your Multi-Agent AI System Is Probably Making Things Worse?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Mon, 05 Jan 2026 23:50:13 +0000</pubDate>
      <link>https://dev.to/imaginex/the-ai-agent-scaling-problem-why-more-isnt-better-9nh</link>
      <guid>https://dev.to/imaginex/the-ai-agent-scaling-problem-why-more-isnt-better-9nh</guid>
      <description>&lt;p&gt;2025 has been dubbed the "Year of the Agent" by investors and tech media. Companies like &lt;a href="https://manus.im/" rel="noopener noreferrer"&gt;Manus&lt;/a&gt;, &lt;a href="https://www.lovart.ai/" rel="noopener noreferrer"&gt;Lovart&lt;/a&gt;, &lt;a href="https://www.fellou.ai/" rel="noopener noreferrer"&gt;Fellou&lt;/a&gt;, and &lt;a href="https://www.google.com/search?q=ai+agent+companies" rel="noopener noreferrer"&gt;many others&lt;/a&gt; have captured headlines with their AI agent applications, which are software systems that can autonomously perform tasks on your behalf, from browsing the web to analyzing documents.&lt;/p&gt;

&lt;p&gt;Over the past two years, I've built multi-agent systems for various clients across different industries by using various AI models and agent frameworks. A pattern keeps emerging: these projects look impressive in demos but struggle to work reliably in production. The same questions come up again and again: why isn't adding more agents helping? Why doesn't giving the system more tokens (via prompt engineering or Retrieval-Augmented Generation pipeline (RAG)), more tool calls, or more compute budget improve results?&lt;/p&gt;

&lt;p&gt;The industry has embraced two assumptions that seem logical on the surface:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;More agents = better results.&lt;/strong&gt; Since a single AI agent has limited capabilities, having multiple agents collaborate should solve more complex problems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute = better performance.&lt;/strong&gt; If results aren't good enough, just give the AI more time to think (more "tokens") and more tools to work with.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But are these assumptions actually true?&lt;/p&gt;

&lt;p&gt;Recent research tells a very different story. A report from UC Berkeley, &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2512.04123" rel="noopener noreferrer"&gt;Measuring Agents in Production&lt;/a&gt;"&lt;/em&gt; (December 2025), combined with two papers from Google DeepMind, systematically debunks both assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;More agents ≠ better results.&lt;/strong&gt; Multi-agent systems often perform &lt;em&gt;worse&lt;/em&gt; than single agents due to coordination overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute ≠ better performance.&lt;/strong&gt; Agents don't know how to effectively use extra resources. They leave 85% of their budget untouched.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These studies reveal that current AI agents have fundamental limitations that no amount of scaling can easily fix. Let me walk you through what the research actually shows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reality Check: What Berkeley Found in Production Systems
&lt;/h2&gt;

&lt;p&gt;The Berkeley team surveyed 306 practitioners and conducted 20 in-depth case studies with organizations actually running AI agents in production, including Accenture, Amazon, AMD, Anyscale, Broadcom Inc., Google, IBM, Intel, Intesa Sanpaolo, Lambda, Mibura Inc, Samsung SDS, and SAP. Crucially, they filtered out demo-stage or conceptual projects, focusing only on systems generating real business value.&lt;/p&gt;

&lt;p&gt;Their findings paint a surprisingly conservative picture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most agents are kept on a very short leash.&lt;/strong&gt; 68% of production systems limit agents to 10 steps or fewer. Only 16.7% allow dozens of steps, and a mere 6.7% give agents unlimited autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Companies build safety barriers between agents and real systems.&lt;/strong&gt; Rather than letting agents directly call production APIs, engineering teams create simplified "wrapper APIs", bundling multiple complex operations into single, safer commands. For example, instead of making an agent call three separate database queries, engineers package them into one pre-tested function. This reduces what could go wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-designed workflows dominate.&lt;/strong&gt; 80% of successful deployments use "structured control flow", meaning humans draw the flowchart, and the AI simply fills in the blanks at predetermined decision points. The agent isn't autonomously planning, it's following a script.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agents require massive instruction sets.&lt;/strong&gt; 12% of deployed systems use prompts exceeding 10,000 tokens (roughly 7,500 words of instructions). These aren't lightweight assistants, they're heavily engineered systems with extensive guardrails.&lt;/p&gt;

&lt;p&gt;In essence, today's successful AI agents work like &lt;strong&gt;tireless interns with good reading comprehension&lt;/strong&gt;, useful within a tightly defined process, capable of handling some ambiguity, but not the autonomous problem-solvers the marketing suggests.&lt;/p&gt;

&lt;p&gt;So why are production systems so constrained? Two papers from Google DeepMind, published in late 2025, may provide the answers by systematically disproving the core assumptions behind agent scaling:&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitation #1: More Agents ≠ Better Performance
&lt;/h2&gt;

&lt;p&gt;DeepMind's paper &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2512.08296" rel="noopener noreferrer"&gt;Towards a Science of Scaling Agent Systems&lt;/a&gt;"&lt;/em&gt; tackled a seductive idea: if one AI isn't smart enough, why not create a whole team? Imagine GPT handling product management, Claude writing code, and Gemini running tests—a virtual software company where PhD-level AIs collaborate to solve any problem.&lt;/p&gt;

&lt;p&gt;It sounds logical. After all, that's how human organizations scale. But 180 controlled experiments later, DeepMind proved this intuition wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experiment Setup
&lt;/h3&gt;

&lt;p&gt;The researchers tested five different ways to organize AI agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single Agent&lt;/strong&gt;: One AI handles everything (think: a solo developer)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Independent Multi-Agent&lt;/strong&gt;: Multiple AIs work on the same problem separately, then their answers are combined through voting (think: getting multiple opinions, then picking the consensus)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decentralized Multi-Agent&lt;/strong&gt;: Agents communicate directly with each other to negotiate solutions (think: a peer-to-peer discussion group)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Multi-Agent&lt;/strong&gt;: One "manager" agent assigns tasks and verifies results (think: a team with a project manager)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid&lt;/strong&gt;: A combination of centralized coordination with peer communication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They tested these architectures using top models from OpenAI, Google, and Anthropic across four different task types: financial analysis, web browsing, game planning (Minecraft-style crafting), and general workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Finding
&lt;/h3&gt;

&lt;p&gt;DeepMind discovered a formula that predicts agent system performance with average 87% accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the costs often outweigh the benefits&lt;/strong&gt;. When coordination overhead, miscommunication, and tool management burden exceed the gains from parallelization, adding more agents makes systems &lt;em&gt;worse&lt;/em&gt;, not better.&lt;/p&gt;

&lt;p&gt;The results varied dramatically by task type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Financial analysis&lt;/strong&gt;: Multi-agent systems helped (up to 81% improvement with centralized architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web browsing&lt;/strong&gt;: Minimal benefit; errors actually got amplified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Game planning (PlanCraft)&lt;/strong&gt;: Multi-agent systems performed &lt;em&gt;significantly worse&lt;/em&gt; than single agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General workflows&lt;/strong&gt;: Mixed results; decentralized approaches slightly better&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See the figure below for the detailed results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4captxy9o19fjjzwtupe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4captxy9o19fjjzwtupe.png" alt=" " width="800" height="596"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Multi-Agent Systems Fail: Three Key Reasons
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. The Coordination Tax&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In complex, open-ended tasks, adding more agents makes the system &lt;em&gt;dumber&lt;/em&gt;, not smarter.&lt;/p&gt;

&lt;p&gt;Consider the PlanCraft benchmark (a Minecraft-style planning task). When Anthropic's Claude model was put into a multi-agent setup, performance dropped by 35%. Why? Every agent must understand tool interfaces, maintain conversation context, and process results from other agents. When the tool count exceeds a threshold, agents spend all their "mental bandwidth" on coordination, reading documentation and attending virtual meetings, with no capacity left for actual problem-solving.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The Capability Saturation Effect&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a single agent can already solve a problem with greater than 45% accuracy, adding more agents typically provides diminishing or negative returns.&lt;/p&gt;

&lt;p&gt;The logic is straightforward: if one agent can correctly answer "What is 2+2?", having three agents debate the answer for an hour won't improve accuracy, it just wastes resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Error Amplification (The Most Surprising Finding)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Intuitively, you'd expect that having three agents vote on an answer would &lt;em&gt;reduce&lt;/em&gt; errors, the wisdom-of-crowds effect. But DeepMind found the opposite.&lt;/p&gt;

&lt;p&gt;In independent multi-agent systems (where agents work separately then vote), errors don't cancel out—they multiply. The paper quantifies this with an "error amplification factor" of 17.2. This means &lt;strong&gt;if a single agent has a 5% error rate, an independent multi-agent system can have an error rate as high as 86%&lt;/strong&gt; (5% × 17.2).&lt;/p&gt;

&lt;p&gt;Why does this happen? Without cross-verification during reasoning, each agent makes errors based on its own flawed logic. These errors become self-reinforcing within each agent's context. When you aggregate three independently wrong answers through voting, you're not getting wisdom, you're getting confident wrongness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitation #2: More Thinking Time ≠ Better Results
&lt;/h2&gt;

&lt;p&gt;If adding more agents doesn't work, what about giving a single agent more time to think?&lt;/p&gt;

&lt;p&gt;After OpenAI released its o1 model, "test-time compute" became a buzzword. The idea: let AI models "think longer" by giving them more computational budget during inference. Search more, reason more, and eventually they'll find the answer, right? But is it true?&lt;/p&gt;

&lt;p&gt;DeepMind's paper &lt;em&gt;"&lt;a href="https://arxiv.org/abs/2511.17006" rel="noopener noreferrer"&gt;Budget-Aware Tool-Use Enables Effective Agent Scaling&lt;/a&gt;"&lt;/em&gt; tested this assumption—and found it largely false.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Experiment
&lt;/h3&gt;

&lt;p&gt;Researchers increased an agent's "tool-call budget", the number of web searches, API calls, or other actions it could perform, from 10 to 100. The expectation: 10x more resources should yield significantly better results.&lt;/p&gt;

&lt;p&gt;The reality: &lt;strong&gt;doubling the budget improved accuracy by only 0.2 percentage points&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even more telling: when given a budget of 100 tool calls, agents only used an average of 14.24 searches and 1.36 browsing sessions. They left 85% of their budget untouched.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Agents Can't Use Extra Resources Effectively
&lt;/h3&gt;

&lt;p&gt;The core problem: &lt;strong&gt;agents don't know what they don't know, and they don't track their remaining budget&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When an agent goes down a wrong path, say, searching for a paper title that doesn't exist, it has no concept of opportunity cost. It will keep digging deeper into a dead end rather than trying a different approach. Give it unlimited compute, and it just digs a deeper hole.&lt;/p&gt;

&lt;p&gt;Making matters worse, long conversation contexts cause "attention drift". After a dozen failed searches, the agent gets lost in its own accumulated noise, the search results, error messages, and dead ends it generated. Performance actually &lt;em&gt;declines&lt;/em&gt; as context grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Potential Solution: Budget-Aware Agents (BATS)
&lt;/h3&gt;

&lt;p&gt;DeepMind proposed a framework called BATS (Budget-Aware Test-time Scaling) that addresses these issues with two key mechanisms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Budget-Aware Planning&lt;/strong&gt;: Instead of making a fixed plan upfront, the agent maintains a dynamic task tree. Each node tracks its status (pending, completed, failed) and resource consumption. When budget is plentiful, expand exploration; when budget is tight, focus on verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Budget-Aware Verification&lt;/strong&gt;: After proposing an answer, a separate verification step checks constraints: What's confirmed? What's contradicted? What can't be verified? Based on this assessment and remaining budget, the agent decides whether to dig deeper or abandon the current path.&lt;/p&gt;

&lt;p&gt;The results were significant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BrowseComp benchmark&lt;/strong&gt;: 24.6% accuracy (95% improvement over standard approaches at 12.6%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BrowseComp-ZH&lt;/strong&gt;: 46.0% accuracy (46% improvement over 31.5%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost efficiency&lt;/strong&gt;: 40% lower total cost (tokens + tool calls) at equivalent accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The lesson: raw thinking time isn't enough. Agents need structured self-reflection, the ability to recognize dead ends, and the wisdom to cut losses early.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Needed for AI Agents to Work?
&lt;/h2&gt;

&lt;p&gt;Let's return to DeepMind's core formula:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Net Performance = (Individual Capability + Collaboration Benefits) − (Coordination Chaos + Communication Overhead + Tool Complexity)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The problem isn't that we need smarter models or more compute. &lt;strong&gt;The problem is that the negative factors, including coordination overhead, communication noise, and tool complexity, are overwhelming the positive factors.&lt;/strong&gt; All of these boil down to one root cause: inefficient use of context.&lt;/p&gt;

&lt;p&gt;Every token spent on coordination, error recovery, or tool documentation is a token not spent on actual problem-solving. To make agents work, we need to reduce this context burden, not pile on more resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three Directions That Actually Work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Smarter Tool Management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key insight from the financial analysis success (81% improvement with multi-agent systems) is instructive: it worked because the task had clear boundaries and well-defined steps.&lt;/p&gt;

&lt;p&gt;Financial analysis follows a predictable pattern: read report → extract data → calculate ratios → generate summary. Each agent fills in blanks within a predetermined framework, no creative planning required.&lt;/p&gt;

&lt;p&gt;This tells us something important: &lt;strong&gt;current AI models cannot self-organize division of labor&lt;/strong&gt;. They can handle easily parallelizable tasks (like financial analysis) or consensus-based error correction (like multi-path search), but not emergent collaboration.&lt;/p&gt;

&lt;p&gt;The implication? For complex tasks, &lt;strong&gt;human-designed task decomposition (SOPs) remains necessary&lt;/strong&gt;. The dream of throwing agents together and watching them spontaneously develop hierarchies has been empirically disproven.&lt;/p&gt;

&lt;p&gt;This is why &lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Anthropic's Skills mechanism&lt;/a&gt; matters: it lets agents accumulate reusable capability modules instead of starting from scratch, reducing the cognitive load of tool management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Built-in Self-Verification&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;BATS works because it formalizes verification. The system explicitly tracks constraints: what's satisfied, what's contradicted, what can't be verified yet. This isn't emergent behavior, it's enforced through careful prompt engineering.&lt;/p&gt;

&lt;p&gt;Without structured verification, errors accumulate silently. Each mistake pollutes the context with garbage that degrades future reasoning. Formal verification catches errors early, preventing context pollution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Efficient Inter-Agent Communication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Current agents coordinate via natural language, verbose, ambiguous, requiring constant clarification. This high message density is inherently wasteful.&lt;/p&gt;

&lt;p&gt;Future improvements might come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured communication protocols (like &lt;a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" rel="noopener noreferrer"&gt;Google's A2A framework&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Latent-space communication where models exchange compressed representations rather than text&lt;/li&gt;
&lt;li&gt;Shared memory architectures that reduce redundant information exchange&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Until these three capabilities mature, such as smart tool management, built-in verification, and efficient communication, multi-agent systems will continue to underperform their theoretical potential.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Despite the hype, the "Year of the Agent" hasn't truly arrived. The research tells us:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Production agents are heavily constrained&lt;/strong&gt;: most limited to 10 steps or fewer, running within human-designed workflows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More agents often means worse performance&lt;/strong&gt;: coordination costs overwhelm collaboration benefits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compute doesn't help much&lt;/strong&gt;: agents don't know how to use extra resources effectively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The path forward is reducing overhead&lt;/strong&gt;, not adding more power&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Current successful AI agents are best understood as &lt;strong&gt;capable interns with good reading comprehension, working within strict SOPs&lt;/strong&gt;. They handle ambiguity better than traditional software, but they're not autonomous problem-solvers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For practitioners, the implication is clear: invest in workflow design, tool abstraction, and structured verification rather than chasing multi-agent architectures or unlimited compute budgets. The engineering fundamentals—not the scaling laws—determine success.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2512.04123" rel="noopener noreferrer"&gt;Measuring Agents in Production&lt;/a&gt; - UC Berkeley (December 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2512.08296" rel="noopener noreferrer"&gt;Towards a Science of Scaling Agent Systems&lt;/a&gt; - Google DeepMind (December 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2511.17006" rel="noopener noreferrer"&gt;Budget-Aware Tool-Use Enables Effective Agent Scaling&lt;/a&gt; - Google DeepMind (November 2025)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview" rel="noopener noreferrer"&gt;Agent Skills - Claude Docs&lt;/a&gt; - Anthropic&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>discuss</category>
    </item>
    <item>
      <title>AI Made Me 10x Faster—Here's What I Had to Change</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 19 Dec 2025 22:05:43 +0000</pubDate>
      <link>https://dev.to/imaginex/ai-made-me-10x-faster-heres-what-i-had-to-change-3j91</link>
      <guid>https://dev.to/imaginex/ai-made-me-10x-faster-heres-what-i-had-to-change-3j91</guid>
      <description>&lt;p&gt;I've been working as an IT engineer in multiple industries for over 25 years, from .Net developer, BI developer, data architect, data scientist, and finally to AI solutions architect. Recently, the team in my organization is developing and using AI programming tools and has achieved 8x to 20x code output efficiency compared to ordinary high-performing teams. See this &lt;a href="https://www.linkedin.com/feed/update/urn:li:activity:7369788200967479296/" rel="noopener noreferrer"&gt;LinkedIn post&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;Reading this, you might think: Wow, are programmers going to lose their jobs? Is AI going to replace humans?&lt;/p&gt;

&lt;p&gt;But my view is exactly the opposite. As a frontline professional programmer, I'm telling you responsibly: When your speed increases 10x, the risks and bottlenecks you face may also be magnified 10x.&lt;/p&gt;

&lt;p&gt;I'll be honest - even I was skeptical at first. Could we really sustain this pace without everything falling apart?&lt;/p&gt;

&lt;p&gt;What this means is: AI has fundamentally changed how "costs" and "benefits" are calculated in software engineering. But to truly enjoy this improvement, the entire software development system needs to be upgraded simultaneously. This insight applies not only to programming but has profound implications for everyone who uses AI tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. AI Hasn't Made Programmers Unemployed, But Has Fundamentally Changed How People Work
&lt;/h2&gt;

&lt;p&gt;Here's how my team works:&lt;/p&gt;

&lt;p&gt;In the code our team submits, 80% - 90% is written by AI. But this is definitely not that casual "Vibe Coding (i.e., coding without thinking)", which is not a good coding practice. This workflow is called "Agentic Coding" (i.e., coding with AI agents).&lt;/p&gt;

&lt;p&gt;In this model, AI plays the role of a "highly capable but irresponsible junior programmer."&lt;/p&gt;

&lt;p&gt;And the human engineers? They are "experienced tech leads or architects."&lt;/p&gt;

&lt;p&gt;Specifically, the engineer's workflow has become:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Design, plan and break down tasks&lt;/strong&gt; - Figure it out yourself first, or brainstorm with AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Give AI instructions&lt;/strong&gt; - Clearly tell AI what to do&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review and refine AI's output&lt;/strong&gt; - This is the most critical step&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate repeatedly&lt;/strong&gt; - Until completely satisfied with the quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Submit and take full responsibility&lt;/strong&gt; - Ultimately, humans are still responsible for the code&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;See that? The role of humans hasn't diminished; it's become more important. The focus of work has just shifted from "writing code by hand" to "defining requirements" and "code review."&lt;/p&gt;

&lt;p&gt;Think about your own work: What would change if 80% of your output came from AI?&lt;/p&gt;

&lt;p&gt;An analogy: Previously you were a worker carrying bricks on a construction site. Now you're an operator commanding an excavator. Although you no longer carry bricks with your own hands, your judgment, operational skills, and responsibilities have actually increased.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Speed Increases 10x, Accident Rate May Also Increase 10x
&lt;/h2&gt;

&lt;p&gt;When you're speeding down a track at 200 km/h, you need massive "downforce" to keep the car firmly planted on the ground. Otherwise, you'll fly off at the first curve.&lt;/p&gt;

&lt;p&gt;In software engineering, "flying off" means bugs and system crashes.&lt;/p&gt;

&lt;p&gt;Some alarming data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; A team might only encounter one or two severe production bugs per year&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Now:&lt;/strong&gt; When you're submitting code at 10x speed, even if the probability of bugs stays the same, the absolute number of bugs you encounter will also increase 10x&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What does this mean?&lt;/p&gt;

&lt;p&gt;Incidents that used to happen once a year might now happen every week. Imagine explaining to your boss why production went down every Monday.&lt;/p&gt;

&lt;p&gt;This kind of "accident rate" is unbearable for any team. Yet many people promoting "AI omnipotence" have intentionally or unintentionally ignored this problem.&lt;/p&gt;

&lt;p&gt;To enjoy the 10x coding speed boost from AI, you must also find ways to reduce the "probability of problems" by 10x, or even more.&lt;/p&gt;

&lt;p&gt;Having a good engine isn't enough; you also need a better braking system.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The True Value of AI: Making "Good but Expensive Methods" Affordable
&lt;/h2&gt;

&lt;p&gt;So how do you reduce risk while increasing speed?&lt;/p&gt;

&lt;p&gt;AI isn't just about letting you write faster; it's about making those "good but too expensive" best practices in software engineering become affordable and feasible.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Change #1: Build a "Wind Tunnel Testing" Environment
&lt;/h3&gt;

&lt;p&gt;What is wind tunnel testing?&lt;/p&gt;

&lt;p&gt;Just like building an airplane - before it actually takes flight, the model is put in a wind tunnel to test various extreme conditions.&lt;/p&gt;

&lt;p&gt;In software development, this means building a "high-fidelity simulation environment" locally. For example, if your system depends on 10 external services (databases, authentication, payments, etc.), you run or simulate all 10 services locally.&lt;/p&gt;

&lt;p&gt;This way, on your computer you can run complete end-to-end tests, and even simulate various extreme failure scenarios.&lt;/p&gt;

&lt;p&gt;This kind of testing can catch a lot of bugs hidden in the "cracks between components."&lt;/p&gt;

&lt;p&gt;Why didn't we do this before? Too expensive!&lt;/p&gt;

&lt;p&gt;Simulating and maintaining these services was too much work, so most teams gave up.&lt;/p&gt;

&lt;p&gt;Why can we do it now? AI excels at this!&lt;/p&gt;

&lt;p&gt;AI is very good at writing these simulation services with clear logic and well-defined behavior. Especially by using AI agents with Model Context Protocol (MCP) and Agent2Agent Protocol (A2A), we can easily build a complete local "wind tunnel" for our fairly complex system.&lt;/p&gt;

&lt;p&gt;Work that used to take weeks or even months can now be done in days.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Change #2: Upgrade CI/CD (Continuous Integration/Deployment)
&lt;/h3&gt;

&lt;p&gt;In the early days of waterfall development, everyone developed separately and then integrated after development. The result was a pile of problems during integration, taking a long time to stabilize.&lt;/p&gt;

&lt;p&gt;Later, the concept of "continuous integration" became popular:&lt;/p&gt;

&lt;p&gt;The earlier you integrate, the earlier you get feedback. The more frequently you integrate, the more you can reduce problem complexity.&lt;/p&gt;

&lt;p&gt;Now, CI/CD is recognized as the best practice in software engineering. But not many teams actually do it well, because building and maintaining it is still not cheap.&lt;/p&gt;

&lt;p&gt;What's worse is that many teams, despite having CI/CD, have extremely time-consuming processes. One code commit, waiting for all tests and deployment to run through - at minimum ten minutes, sometimes several hours.&lt;/p&gt;

&lt;p&gt;Before AI, these problems weren't obvious. Now that AI is more capable, they've become obstacles.&lt;/p&gt;

&lt;p&gt;So CI/CD also needs to be upgraded along with it, compressing the feedback loop from "hours" to "minutes." You need infrastructure that's fast to an absurd degree, able to discover, isolate, and roll back problematic changes within minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Change #3: Decision-Making and Communication Systems Also Need Upgrading
&lt;/h3&gt;

&lt;p&gt;10x code output means you also need 10x or more communication and decision-making efficiency.&lt;/p&gt;

&lt;p&gt;Before, developing a system required various meetings, lengthy discussions, and only then could work begin. After all, you had to depend on other people's modules, so you had to define agreements first, otherwise you couldn't integrate later.&lt;/p&gt;

&lt;p&gt;Various technical decisions also required repeated discussion for a long time, because development costs were high back then, and if decisions were wrong, the cost of rework was too great.&lt;/p&gt;

&lt;p&gt;But now, if we still have the same communication efficiency as before, it will greatly drag down overall efficiency.&lt;/p&gt;

&lt;p&gt;Perhaps the most efficient approach is to minimize communication as much as possible, letting everyone do their tasks as independently as possible from others.&lt;/p&gt;

&lt;p&gt;For example, microservices architecture might be a good choice in the AI era.&lt;/p&gt;

&lt;p&gt;For technical decisions, now you can actually have more opportunities to experiment. You don't need to be as rigorous as before in repeatedly verifying technical decisions. Because development costs have decreased, the cost of experimentation has also decreased.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Insights for Everyone
&lt;/h2&gt;

&lt;p&gt;AI isn't a "stimulant" that makes you run faster; it's giving you a "supercar."&lt;/p&gt;

&lt;p&gt;But the question is: Are you ready to drive it?&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Tool Upgrade Doesn't Equal System Upgrade
&lt;/h3&gt;

&lt;p&gt;Using AI is like upgrading your car with a brand new engine. If you just install it on your old "vintage car," what you get isn't 10x speed, but 10x problems.&lt;/p&gt;

&lt;p&gt;This principle applies to each of us.&lt;/p&gt;

&lt;p&gt;When you learn to use AI tools (ChatGPT, Gemini, Claude, Midjourney, various AI assistants), don't assume your work efficiency will automatically improve.&lt;/p&gt;

&lt;p&gt;Your workflows, quality inspection mechanisms, and collaboration methods all need to be adjusted accordingly.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Using AI to write code is fast, but if you don't have a rigorous review process, you might produce a lot of low-quality, buggy code.&lt;/p&gt;

&lt;p&gt;Using AI to quickly generate investment advice - but has your risk assessment ability kept up?&lt;/p&gt;

&lt;p&gt;Using AI to make quick decisions - but have you established a review and error-correction mechanism?&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Speed Increase Must Come with Risk Management Improvement
&lt;/h3&gt;

&lt;p&gt;The "accident rate paradox" doesn't only exist in programming.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The faster food delivery, the higher the traffic accident risk&lt;/li&gt;
&lt;li&gt;The faster product iteration, the more quality issues there might be&lt;/li&gt;
&lt;li&gt;The faster decision-making, the greater the probability of making mistakes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So don't blindly pursue "fast." Ask yourself: Has my "braking system" been upgraded?&lt;/p&gt;

&lt;p&gt;Build your own "wind tunnel testing": Try on a small scale first, rather than pushing forward comprehensively right away.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Re-examine Those "Good but Expensive" Methods
&lt;/h3&gt;

&lt;p&gt;Here's what finally clicked for me after months of using AI tools:&lt;/p&gt;

&lt;p&gt;The true value of AI isn't just about "writing faster"; it's about making those "good but too expensive" best practices become affordable and feasible.&lt;/p&gt;

&lt;p&gt;This realization made me re-examine many good habits I had abandoned because they were "too troublesome":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personal finance management:&lt;/strong&gt; Keeping track of expenses used to be too troublesome. Now AI can help you automatically categorize and analyze&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning new skills:&lt;/strong&gt; Hiring a private tutor used to be too expensive. Now AI can provide personalized tutoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Health management:&lt;/strong&gt; Nutritionists used to be too expensive. Now AI can customize meal plans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content creation:&lt;/strong&gt; Making videos used to require a team. Now individuals can also produce high-quality content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key is: Don't just use AI for "fast production." Use it to achieve "things you wanted to do before but couldn't."&lt;/p&gt;

&lt;h3&gt;
  
  
  4.4 Your Role is Changing
&lt;/h3&gt;

&lt;p&gt;In programming, my role has shifted from "executor" to "decision-maker + quality inspector." But this pattern applies everywhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Writers&lt;/strong&gt; are becoming editors who review and refine AI drafts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Designers&lt;/strong&gt; are becoming creative directors who guide AI-generated concepts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analysts&lt;/strong&gt; are becoming strategists who interpret AI-processed data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managers&lt;/strong&gt; are becoming orchestrators who coordinate AI-assisted workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common thread? Responsibilities haven't decreased; they've actually increased.&lt;/p&gt;

&lt;p&gt;In the AI era, your core competitiveness is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Judgment&lt;/strong&gt; - Being able to distinguish good from bad AI output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Questioning ability&lt;/strong&gt; - Being able to give AI clear instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sense of responsibility&lt;/strong&gt; - Being willing to take responsibility for the final results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Systems thinking&lt;/strong&gt; - Understanding the entire process, not just one part&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Many people can use AI to write reports, but those who can review AI's logical flaws are valuable.&lt;/p&gt;

&lt;p&gt;Many people can use AI to design solutions, but those who can judge a solution's feasibility are irreplaceable.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.5 Build Your AI Work System
&lt;/h3&gt;

&lt;p&gt;For us ordinary people, don't just focus on "individual AI tools." Build your own "AI work system":&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Input system&lt;/strong&gt; - How to quickly and accurately provide AI with information and instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality inspection system&lt;/strong&gt; - How to efficiently review AI's output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback system&lt;/strong&gt; - How to iterate and improve quickly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge management&lt;/strong&gt; - How to accumulate and reuse experience from AI collaboration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;It's not just about using ChatGPT; you also need to build your own prompt library, review checklist, and iteration workflow.&lt;/p&gt;

&lt;p&gt;It's not just about using AI for image generation; you also need to establish a style library, quality standards, and version management.&lt;/p&gt;

&lt;p&gt;This is the true way of working in the AI era.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Summary
&lt;/h2&gt;

&lt;p&gt;The AI era requires "systems thinking," not "tool thinking."&lt;/p&gt;

&lt;p&gt;Many people treat AI as a "fast production tool," hoping to use it to accelerate existing work.&lt;/p&gt;

&lt;p&gt;But those who truly understand how to leverage AI treat it as an "opportunity for system upgrade," rethinking the entire workflow.&lt;/p&gt;

&lt;p&gt;AI isn't just about upgrading the car's engine; it's also about upgrading the roads the car frequently drives on.&lt;/p&gt;

&lt;p&gt;The goal for veteran drivers isn't to be replaced by AI, but to help them adapt to the new high-speed engine, giving them a comfortable and safe driving environment.&lt;/p&gt;

&lt;p&gt;So when AI increases your speed 10x or 20x, don't rush to celebrate.&lt;/p&gt;

&lt;p&gt;First ask yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has my quality inspection mechanism been upgraded?&lt;/li&gt;
&lt;li&gt;Has my risk management ability improved?&lt;/li&gt;
&lt;li&gt;Has my workflow been restructured?&lt;/li&gt;
&lt;li&gt;Am I ready to take full responsibility for the results?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Remember: You are the driver who is responsible for the final results.&lt;/p&gt;

&lt;p&gt;AI just gave you a supercar, but whether you can arrive at your destination safely and efficiently depends on your driving skills and road conditions.&lt;/p&gt;

&lt;p&gt;May we all become good drivers in the AI era - ones who can step on the gas, and also know when to brake.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>career</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Key Breakthroughs in AI Engineering that Every AI Engineer Must Know</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Fri, 19 Dec 2025 19:52:47 +0000</pubDate>
      <link>https://dev.to/imaginex/key-breakthroughs-in-ai-engineering-that-every-ai-engineer-must-know-3ibh</link>
      <guid>https://dev.to/imaginex/key-breakthroughs-in-ai-engineering-that-every-ai-engineer-must-know-3ibh</guid>
      <description>&lt;p&gt;This blog post provides a clear understanding of the logic behind the entire technological evolution, showing how we went step by step from "this thing can run" to "this thing can actually do work." We'll explain the development of AI engineering from 2017 to the present in a simple and easy-to-understand way. The key breakthroughs are grouped into 4 categories as follows:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Beginning of Everything: From "Architectural Revolution" to "Emergent Capabilities"
&lt;/h2&gt;

&lt;p&gt;The starting point of this story is 2017. The famous paper called &lt;a href="https://arxiv.org/pdf/1706.03762" rel="noopener noreferrer"&gt;"Attention Is All You Need"&lt;/a&gt; is the "birth certificate" of the Transformer architecture, which is the foundation of all modern large language models.&lt;/p&gt;

&lt;p&gt;Before it, models processed text word by word "sequentially" like &lt;a href="https://en.wikipedia.org/wiki/Recurrent_neural_network" rel="noopener noreferrer"&gt;RNNs&lt;/a&gt;, which was not only slow but also struggled with long texts (like forgetting what was said earlier when reading to the end). The core contribution of Transformer was introducing the "Self-Attention" mechanism, allowing the model to "simultaneously" look at all words in a sentence and determine which words are more important to each other.&lt;/p&gt;

&lt;p&gt;This brought two huge benefits: training can be massively parallelized, and handling long-range dependencies became much better. Following that, in 2020, &lt;a href="https://arxiv.org/pdf/2005.14165" rel="noopener noreferrer"&gt;OpenAI's GPT-3 paper "Language Models are Few-Shot Learners"&lt;/a&gt; brought the second key breakthrough. It showed that with a large enough model, it can learn to perform a variety of tasks with just a few examples (few-shot). When you scale up the Transformer model large enough, it will "emerge" with a new capability called "In-Context Learning."&lt;/p&gt;

&lt;p&gt;This means you no longer need to fine-tune or "customize" the model for specific tasks (like translation or summarization). You just need to give it a few examples (few-shot) in the prompt, and it can "learn by imitation" and understand what you actually want to do. This discovery completely changed the rules of the game. For practitioners like us, this means we can use a general-purpose foundation model and solve various problems through "prompt engineering" or "context engineering".&lt;/p&gt;

&lt;h2&gt;
  
  
  "Training" the Model: How to Make It Obedient, Professional, and Cost-Effective?
&lt;/h2&gt;

&lt;p&gt;With GPT-3's "brute force creates miracles" approach, people quickly discovered new problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It's powerful but doesn't "listen"&lt;/strong&gt; - it often talks nonsense and sometimes even outputs toxic content. We need to make it "listen" to us and "obey" our instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's "expensive"&lt;/strong&gt; - if you want it to perform better in a specialized domain (like law or medicine), the cost of full fine-tuning is terrifyingly high. We need to find a way to make it more cost-effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It's a "bookworm"&lt;/strong&gt; - training data has a cutoff date, and it doesn't know about new knowledge from the outside world or company internal materials. We need to make it more "knowledgeable" on external knowledge and "professional" in the specialized domain.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the following key breakthroughs were about solving these "usability" issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Making It "Obedient" - InstructGPT (2022)
&lt;/h3&gt;

&lt;p&gt;The core of this breakthrough is solving the "Alignment" problem, that is, making the model "listen" to us and "obey" our instructions. The paper &lt;a href="https://arxiv.org/pdf/2203.02155" rel="noopener noreferrer"&gt;"Training language models to follow instructions with human feedback"&lt;/a&gt; introduced RLHF (Reinforcement Learning from Human Feedback) on a model called InstructGPT. &lt;/p&gt;

&lt;p&gt;Simply put, the process is as follows: first have humans rank the model's different responses, then train a "Reward Model" to mimic human preferences, and finally use this reward model to "train" the large model.&lt;/p&gt;

&lt;p&gt;The biggest insight from this breakthrough is: "A smaller but 'aligned' model can outperform a much larger but unaligned model in user satisfaction." This made everyone realize that bigger isn't always better - the value of "alignment" or "obedience" is extremely high.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Making It "Cost-Effective" - LoRA (2021)
&lt;/h3&gt;

&lt;p&gt;When we need to make the model "professional" in a specialized domain, full fine-tuning (updating all model parameters with new training data) was usually the only way. However, full fine-tuning is too expensive. Is there a cheaper way? &lt;a href="https://arxiv.org/pdf/2106.09685" rel="noopener noreferrer"&gt;LoRA (Low-Rank Adaptation)&lt;/a&gt; came along. &lt;/p&gt;

&lt;p&gt;Its idea is particularly clever: during fine-tuning, we don't touch the billions of original parameters (keep them frozen), but instead "insert" some small, trainable "adapters" into different layers of the model. These adapters have very few parameters (possibly only 0.01% of total parameters) but can still achieve good performance.&lt;/p&gt;

&lt;p&gt;The result is that it lowered the barrier to fine-tuning from "only big companies can afford it" to "you can run it on a single GPU." This was revolutionary for AI application deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Making It "Professional" - RAG (2020)
&lt;/h3&gt;

&lt;p&gt;Before RAG, the model was like a "bookworm" that only knew what it was trained on. It didn't know about new knowledge from the outside world or company internal materials. Moreover, it was prone to "hallucination", that is, making things up when it didn't know the answer. How do you solve the model's "outdated knowledge" and "hallucination" problems? The answer is &lt;a href="https://arxiv.org/pdf/2005.11401" rel="noopener noreferrer"&gt;RAG (Retrieval-Augmented Generation)&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;The idea is straightforward: "Before the model answers a question, don't rush to make things up. First go to an external knowledge base (like company internal databases, or the internet) to retrieve a batch of relevant documents, treat these documents as 'open-book exam' materials, feed them to the model, and let it answer you based on these materials."&lt;/p&gt;

&lt;p&gt;Today, RAG is practically standard for all production-grade LLM applications (like AI customer service, knowledge base Q&amp;amp;A). It is a key technology for making models "professional" and "useful".&lt;/p&gt;

&lt;h2&gt;
  
  
  Pushing Efficiency to the Limit: How to Make Models Run on Devices with Limited Resources?
&lt;/h2&gt;

&lt;p&gt;When models really need to be deployed to consumer products, or even smartphones, "efficiency" and "cost" become matters of life and death. The next few breakthroughs revolve around "optimization."&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Making Models "Smaller" - DistilBERT (2019)
&lt;/h3&gt;

&lt;p&gt;This breakthrough uses "&lt;a href="https://arxiv.org/pdf/1910.01108" rel="noopener noreferrer"&gt;Knowledge Distillation&lt;/a&gt;" technology. The idea is to have a large, smart "teacher" model (like BERT) teach a small "student" model (like DistilBERT), making the student model mimic the teacher's behavior. The result is that the student model retains 97% of the teacher's language understanding capability, but with 40% fewer parameters and 60% faster speed. This made running AI on smartphones and "edge devices" (like smart home gadgets and wearables) possible. This is a key technology for making models "small" and "efficient".&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Making Models "Memory-Efficient" - LLM.int8 (2022)
&lt;/h3&gt;

&lt;p&gt;This breakthrough is about "&lt;a href="https://arxiv.org/pdf/2208.07339" rel="noopener noreferrer"&gt;Quantization&lt;/a&gt;." Simply put, it's trying to use fewer bits to store model weights, like going from 32-bit floating point numbers down to 8-bit integers (int8), directly reducing memory usage by 4x.&lt;/p&gt;

&lt;p&gt;The challenge is that crude compression can cause severe accuracy degradation. The "key insight" of this breakthrough is that they discovered only a very small number of "outlier features" in the model were causing trouble. So they used a "mixed precision" approach: "Store the vast majority of weights using int8, but store those critical 'outlier values' using 16 bits." The result is almost no accuracy loss, but memory savings achieved.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Making Models "More Flexible" - Switch Transformers (2021)
&lt;/h3&gt;

&lt;p&gt;This breakthrough is about the &lt;a href="https://arxiv.org/pdf/2101.03961" rel="noopener noreferrer"&gt;"MoE" (Mixture of Experts)&lt;/a&gt; architecture. The idea is to instead of training one "jack-of-all-trades" large model, you train a bunch of "specialized" expert small models (like one good at math, one good at writing poetry). For each prediction (i.e., predicting a token), you first use a "Router" to determine which "expert" is most suitable to handle this task, then only activate that one expert. &lt;/p&gt;

&lt;p&gt;The benefit is that your model's total parameter count can be very large (like trillion-scale), but the actual computational cost is very low because you only use a small portion of it each time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future Puzzle: "Agents" and "Standards"
&lt;/h2&gt;

&lt;p&gt;When we want to make models really "useful" and "capable of doing work" for us in the real world, we need to solve the problem of "how to make models interact with the outside world". The following three breakthroughs are about solving this problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Making Models "Capable of Doing Work" - Agents (2023)
&lt;/h3&gt;

&lt;p&gt;The technology called &lt;a href="https://arxiv.org/pdf/2309.07864" rel="noopener noreferrer"&gt;"LLM Agents"&lt;/a&gt; proposes a basic system framework which includes: Brain (LLM, responsible for thinking and planning), Perception (responsible for reading external information, like return results from tools), and Action (responsible for calling APIs or tools).&lt;/p&gt;

&lt;p&gt;This means models are no longer just "chatbots" - they can start helping you "do things," like booking flights, analyzing financial reports, executing code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Making Models "Interconnected" - MCP (2024)
&lt;/h3&gt;

&lt;p&gt;Before MCP, when AI needs to call an external tool (like your calendar, databases, etc.), you have to write a custom "one-to-one" integration interface, which is very troublesome. Anthropic (the developer of Claude) proposed the "&lt;a href="https://www.anthropic.com/news/model-context-protocol" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;" in 2024 to solve this problem.&lt;/p&gt;

&lt;p&gt;The core idea of MCP is to create an "open standard" for how AI models communicate with all external tools and APIs. Just like the HTTP protocol unified communication between web browsers and servers, MCP hopes to unify how AI models communicate with all external tools and APIs. If this standard gets adopted, the connectivity efficiency of the AI ecosystem will see a qualitative leap.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Making Agents "Interoperable" - A2A Protocol (2025)
&lt;/h3&gt;

&lt;p&gt;While MCP connects AI models to tools, what happens when you have multiple AI agents that need to work together? Imagine you have one agent handling your calendar, another managing your emails, and a third analyzing your documents. How do they coordinate? The &lt;a href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/" rel="noopener noreferrer"&gt;"Agent2Agent (A2A) Protocol"&lt;/a&gt; proposed in 2025 addresses exactly this problem.&lt;/p&gt;

&lt;p&gt;Think of it like this: MCP is like giving each AI agent a phone to call different services. A2A is like giving all the agents a group chat so they can talk to each other. The protocol allows AI agents built on different technologies to communicate, share information securely, and coordinate their actions. This is complementary to MCP - together they create a complete ecosystem where AI can both use tools and collaborate with other AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The evolution path of AI engineering is actually very clear. It's a chain of continuously solving key problems: first making the model "able to run" (Transformer), then making it "able to learn" (GPT-3), then making it "obedient" (InstructGPT), then making it "useful and affordable" (LoRA, RAG, Quantization), and finally making it "able to do work" (Agents, MCP, A2A). Each step here represents a huge leverage point.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Developing AI Agent Application with Azure AI Foundry - Why and How?</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Wed, 11 Jun 2025 19:50:46 +0000</pubDate>
      <link>https://dev.to/imaginex/developing-ai-agents-with-azure-ai-foundry-why-and-how-4d7c</link>
      <guid>https://dev.to/imaginex/developing-ai-agents-with-azure-ai-foundry-why-and-how-4d7c</guid>
      <description>&lt;h2&gt;
  
  
  What is this blog post about?
&lt;/h2&gt;

&lt;p&gt;This technical guide explores the design and development of AI agents using &lt;a href="https://learn.microsoft.com/en-us/azure/ai-services/agents/overview" rel="noopener noreferrer"&gt;Azure AI Foundry&lt;/a&gt;. It explains both the rationale and methodology for building AI agents with Azure AI Foundry, targeting solution architects, AI engineers, and data scientists seeking to leverage AI for business transformation.&lt;/p&gt;

&lt;p&gt;The guide provides a step-by-step walkthrough for creating an AI agent application analyzes user input, generates image design recommendations, and produces unique designs based on some instructions, patterns, or user preferences. It features code snippets, architectural diagrams, and practical explanations on utilizing Azure AI Foundry and OpenAI models to build a robust image design application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why does it matter?
&lt;/h2&gt;

&lt;p&gt;AI agents are increasingly vital for organizations aiming to harness artificial intelligence to optimize processes and products. They automate tasks, analyze data, and deliver insights that drive better decision-making and product innovation. AI agents also enhance customer experiences through personalized recommendations and support, helping businesses remain competitive in a rapidly evolving landscape.&lt;/p&gt;

&lt;p&gt;Despite their benefits, building AI agents can be complex and resource-intensive, often requiring deep technical expertise. This blog post aims to demystify the process by offering a clear, actionable guide for creating AI agents with Azure AI Foundry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why use Azure AI Foundry?
&lt;/h3&gt;

&lt;p&gt;Azure AI Foundry is a comprehensive, cloud-based platform that delivers a suite of AI tools and services for scalable AI adoption. It supports model deployment, agent creation, fine-tuning, data management, and seamless integration with Azure services. Organizations can build custom AI models and agents tailored to their needs, streamlining workflows and boosting productivity.&lt;/p&gt;

&lt;p&gt;Key benefits of Azure AI Foundry include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low-code development:&lt;/strong&gt; Rapidly deploy models, set up agents, create knowledge and data indexes, and integrate components with minimal coding via fully managed agent services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible deployment:&lt;/strong&gt; Deploy models and agents across regions and availability zones to minimize latency and serve users closer to their location. Deployments can also be managed within your own Azure subscription or private cloud for greater control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; Effortlessly scale resources to match business demands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise integration:&lt;/strong&gt; Seamlessly connect with Microsoft 365, Dynamics 365, Azure services, Databricks, and more.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Context Protocol (MCP) support:&lt;/strong&gt; Leverage the open-source &lt;a href="https://modelcontextprotocol.io/introduction" rel="noopener noreferrer"&gt;Model Context Protocol&lt;/a&gt; (MCP) to enable AI agents to communicate and interoperate seamlessly with tools and agents across diverse ecosystems.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent orchestration support:&lt;/strong&gt; Build sophisticated AI applications by orchestrating multiple agents to collaborate on complex workflows and problem-solving. Open interoperability standards—such as &lt;a href="https://github.com/google-a2a/A2A" rel="noopener noreferrer"&gt;Agent2Agent (A2A)&lt;/a&gt;—enable seamless information exchange and coordination among agents across Azure, AWS, Google Cloud, and on-premises environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proprietary model access:&lt;/strong&gt; Instantly access advanced proprietary models such as Grok 3, GPT-4o, o3, and more, empowering your applications with state-of-the-art natural language processing and image generation capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source model support:&lt;/strong&gt; Access over 10,000 open-source models—including Llama, Mistral, DeepSeek, and more—from a variety of vendors directly within Foundry Models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build-in monitoring, tracing, and governance:&lt;/strong&gt; Access deep insights into agent performance and behavior, enabling continuous improvement through built-in monitoring, tracing, and governance features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details on Azure AI Foundry's features, please refer to the blog of &lt;a href="https://azure.microsoft.com/en-us/blog/azure-ai-foundry-your-ai-app-and-agent-factory/" rel="noopener noreferrer"&gt;Azure AI Foundry: Your AI App and agent factory&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about AWS Bedrock and Google Cloud Vertex?
&lt;/h3&gt;

&lt;p&gt;AWS Bedrock is a powerful cloud-based AI platform with similar capabilities for building and deploying AI agents. However, it can be more complex to configure and maintain, as many AWS AI services are based on open-source components that require additional setup. Azure AI Foundry, by contrast, offers a more integrated and user-friendly experience, with built-in Azure service support and direct access to OpenAI models, making it easier to create and deploy AI agents without extensive technical expertise.&lt;/p&gt;

&lt;p&gt;Google Cloud Vertex AI also provides robust tools for building and deploying AI agents, including the Agent Development Kit (ADK) and support for the open A2A protocol for cross-ecosystem agent communication. However, Google Cloud's ecosystem-centric approach may limit integration with external platforms, making it more challenging to build agents requiring broad interoperability. Azure AI Foundry, in contrast, offers deeper integration across the Microsoft ecosystem and simplifies multi-service deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about the other open-source AI agent frameworks, such as LangChain, CrewAI, etc.?
&lt;/h3&gt;

&lt;p&gt;There are several open-source AI agent frameworks—such as LangChain, LlamaIndex, and CrewAI—that offer flexibility and extensibility for building custom AI solutions. However, these frameworks often require considerable effort to set up, configure, and maintain, especially when integrating with enterprise data, applications, and infrastructure. Seamless integration with Azure services and OpenAI models is typically less mature, and features like scalability, monitoring, and governance may require additional tooling or custom development compared to Azure AI Foundry's built-in capabilities.&lt;/p&gt;

&lt;p&gt;Open-source frameworks are generally best suited for proof-of-concept (PoC) development, desktop application development, or highly specialized tasks, such as advanced natural language processing or domain-specific data analysis. They often require orchestrating multiple cloud services and AI models, which can add operational complexity. For example, building a healthcare data analysis agent with CrewAI might involve integrating Claude 4 on AWS Bedrock for code generation, Gemini on Google Cloud Vertex AI for healthcare data understanding, and OpenAI image models on Azure AI Foundry for image generation. This multi-cloud, multi-model approach can increase both technical complexity and operational overhead.&lt;/p&gt;

&lt;p&gt;When choosing an AI agent framework, a key consideration is the location of your data and applications. If your ecosystem is primarily on Azure, then Azure AI Foundry could be naturally the choice. It provides a unified, enterprise-ready platform that streamlines the process of building, deploying, and managing AI agents—while leveraging the full power of Azure services and OpenAI models.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to build AI agents with Azure AI Foundry?
&lt;/h2&gt;

&lt;p&gt;The rest of this blog post will guide you through the process of building an AI agent application using Azure AI Foundry, focusing on image design (i.e., editing and generation capabilities). The goal is to create AI agents that can analyze user input, generate design recommendations based on some instructions, patterns or user preferences, and create unique designs. The following sections will cover the solution, architecture design, and a step-by-step process to set up and develop the application.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the solution?
&lt;/h3&gt;

&lt;p&gt;The solution is to leverage AI agents to streamline the image design process, enabling people to quickly adapt to changing trends, styles, and preferences. By using Azure AI Foundry and OpenAI models, people can create a comprehensive image design application that automates the analysis of user input, generates recommendations, and creates unique designs.&lt;br&gt;
This application will allow people to focus on creativity and innovation, while the AI agent handles the repetitive and time-consuming tasks of analyzing instructions and preferences and generating designs. The application will also provide a user-friendly interface for people to interact with the AI agent, making it easy to input design requirements, reference images, and instructions.&lt;/p&gt;
&lt;h4&gt;
  
  
  What are the AI models used?
&lt;/h4&gt;

&lt;p&gt;OpenAI models are state-of-the-art generative AI models that can be used for various applications, including natural language processing, image generation, and more. These models can help people create unique and personalized designs, analyze instructions and consumer preferences, and generate design recommendations. By integrating OpenAI models with Azure AI Foundry, people can create a powerful image design application that leverages the strengths of both technologies.&lt;/p&gt;
&lt;h4&gt;
  
  
  What is the architecture design?
&lt;/h4&gt;

&lt;p&gt;To get started, the diagram below shows the architecture that will be developed. Simply put, it utilizes Azure AI Foundry for agent orchestration, OpenAI’s image generation model (gpt-image-1), blob storage to store instructions (i.e., documents) and images, and an app service to host the Streamlit UI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9idcf4ylmbgfpm06lg2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9idcf4ylmbgfpm06lg2.png" alt=" " width="800" height="663"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter detailed text prompts describing the desired image design.&lt;/li&gt;
&lt;li&gt;Select existing patterns or preferred styles from a library or upload a reference image to guide the AI.&lt;/li&gt;
&lt;li&gt;Choose an instruction to inform the design.&lt;/li&gt;
&lt;li&gt;Generate initial images using AI based on inputs.&lt;/li&gt;
&lt;li&gt;Refine designs by modifying prompts or editing the generated images.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Step-by-Step Process to Setup and Develop the Application
&lt;/h2&gt;

&lt;p&gt;This section provides a technical look "under the hood" to demonstrate how the previously mentioned architecture was implemented. The following code snippets and explanations detail the core logic, from interpreting a designer's initial text prompt to generating a final, production-ready images. We will walk through how each component—setting up the Azure AI Foundry project, AI agent, user input analysis, reference image processing, instruction integration, and image generation—is built and orchestrated using Python, Azure AI, and OpenAI models.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Setup AI Project on Azure AI Foundry
&lt;/h3&gt;

&lt;p&gt;Start by setting up an Azure AI Foundry account and creating a new project. This will provide you with access to the necessary tools and services to create custom AI models. Azure AI Foundry can host multiple projects, you need to first create a Hub, and a Project within it. For the details of how to create a Hub and Project, please refer to the &lt;a href="https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/create-projects?tabs=ai-foundry&amp;amp;pivots=fdp-project" rel="noopener noreferrer"&gt;Azure AI Foundry documentation&lt;/a&gt;. The following screenshot shows an example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezy3c9rduagvr3jxre7b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezy3c9rduagvr3jxre7b.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, we create an Azure AI hub and link it to the subscription of "IX Sandbox". And then we create a project as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6bi8ge8gqvpnfg65hst.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6bi8ge8gqvpnfg65hst.png" alt=" " width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After creating the project, you will need to create/enable Azure AI services that will be used in the project. The following screenshot shows an example of creating/enabling Azure AI Services.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdb8cmbpcmhdjpxcvrfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdb8cmbpcmhdjpxcvrfz.png" alt=" " width="800" height="476"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, we create an AI Services and then connect it to the project. The available services include Azure OpenAI, Speech, Content understanding, Translation and a lot of other Azure AI capabilities. For the details of how to create and manage Azure AI services, please refer to the &lt;a href="https://azure.microsoft.com/en-us/products/ai-services" rel="noopener noreferrer"&gt;Azure AI Services website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Once the project and AI services are created, you can manage your AI models, data, and other resources. The dashboard provides an overview of your project and allows you to monitor the performance of your AI models. Let's set up some AI models to be used in this project. Go to the "&lt;strong&gt;My assets&lt;/strong&gt;" section on the left panel, and then select "&lt;strong&gt;Models + endpoints&lt;/strong&gt;". You can see the list of already deployed AI models that are available in the project. The following screenshot shows an example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6y44ojse61ql0lyx9wi9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6y44ojse61ql0lyx9wi9.png" alt=" " width="800" height="589"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By clicking the "&lt;strong&gt;+ Deploy model&lt;/strong&gt;" button and selecting "&lt;strong&gt;Deploy base model&lt;/strong&gt;", you can deploy a base model from the Azure AI Foundry model catalog. Follow the instructions to select the models that will be used in the project, such as OpenAI &lt;strong&gt;&lt;code&gt;gpt-4o&lt;/code&gt;&lt;/strong&gt; and OpenAI &lt;strong&gt;&lt;code&gt;gpt-image-1&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To trace and monitor the AI agent behavior and activities, we can set up and enable Azure AI Foundry Application Insights Tracing feature to log and visualize the information when the AI agent is being called. Find the “&lt;strong&gt;Tracing&lt;/strong&gt;” on the left-hand side of navigation and then click the “&lt;strong&gt;Create New&lt;/strong&gt;” button shown in the screenshot below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtmg11onzjapaod5l5um.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtmg11onzjapaod5l5um.png" alt=" " width="800" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And then fill the form on the popup window to create a new Application Insights resource. After you run the AI application, you can view the list of traced agent activities in the Tracing section. When you click on a tracing item, the popup window will show and visualize the information logged while the AI agent is being called and running, like the screenshot shown below. Tracing provides deep visibility into execution of your application by capturing detailed telemetry at each execution step. This helps diagnose issues and enhance performance by identifying problems such as inaccurate tool calls, misleading prompts, high latency, low-quality evaluation scores, and more.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye9mi8o4gm6vyspdc6un.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye9mi8o4gm6vyspdc6un.gif" alt=" " width="760" height="442"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Setup Azure AI Search Service
&lt;/h3&gt;

&lt;p&gt;The next step is to load a document. This will be used to provide instruction into the design requirements. The document can be uploaded into a container of an Azure blob storage account and indexed by the Azure AI Search service, as shown in the screenshot below.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1os338to44kv85txbmvo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1os338to44kv85txbmvo.png" alt=" " width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The technology behind the Azure AI Search service is called "&lt;a href="https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en" rel="noopener noreferrer"&gt;Retrieval Augmented Generation (RAG)&lt;/a&gt;", which is a powerful tool that can be used to index and search large amounts of unstructured data (e.g., PDF documents). For the details of how to use Azure AI Search service to index and search document contents in a blog storage, please refer to the &lt;a href="https://learn.microsoft.com/en-us/azure/search/search-blob-storage-integration" rel="noopener noreferrer"&gt;Azure AI Search service documentation&lt;/a&gt;. And then, we can configure an AI agent in the Azure AI Foundry to use the Azure AI Search service to search the instruction and extract the relevant information.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Create an AI Agent
&lt;/h3&gt;

&lt;p&gt;We will create an AI agent in the Azure AI Foundry project, which will analyze the instruction and provide a design recommendation into the design requirements. The following screenshot shows an example of the AI Agent to search an instruction and generate a design recommendation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpe2qcdwbibaokow5tvno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpe2qcdwbibaokow5tvno.png" alt=" " width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, we create an AI agent and select the deployed gpt-3.5-turbo model and the created index in the "Knowledge" section.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. Develop Code to Analyze User's Input
&lt;/h3&gt;

&lt;p&gt;Defining the design requirements would include factors such as target audience, style, preferences, and standards. For this component, we will let the designers to type the requirements in a text box and then the requirement will be used to create a prompt for the OpenAI model. In order to understand what type of images the designer is looking for, we need to analyze the user input and determine the image type. The following code snippet shows how to analyze the user input and determine the image type by leveraging OpenAI &lt;code&gt;gpt-4o&lt;/code&gt; model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_user_input_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Azure OpenAI client
&lt;/span&gt;    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AzureOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-12-01-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;azure_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AZURE_GPT4_MODEL_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;completion&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# or gpt-4o-mini for cost saving
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant that can analyze the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input and then determine the image type.
                For example, if the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input is &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want to design a clothing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, the product type is &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clothing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. 
                If the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input is &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I want to design a house&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, the product type is &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;house&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. 
                Now, here is the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
                Please determine the image type based on the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input. If the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s input is not clear, return the general product type of &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drawing image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.
                &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;top_p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;frequency_penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;presence_penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="c1"&gt;# The image type response from the AI model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Develop Code to Add a Reference Image
&lt;/h3&gt;

&lt;p&gt;The next component is to add a reference image for the image design. The idea of using a reference image is to provide a visual representation of the silhouettes, design patterns or detailed requirements. This can help the AI model to better understand the user's input, guidance, or/and patterns, and generate more accurate design images. The reference image can be uploading an existing image or picking a pre-defined design outline image from a list. The uploaded image can be converted into a design outline image. For example, the following shows a reference image, which is a pre-defined design outline image for a girl's dress.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F732zl2fiy4ibpuphm9c9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F732zl2fiy4ibpuphm9c9.jpg" alt=" " width="200" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This type of reference image will be used to generate new designs based on the user's input of the design requirements. The following code snippet shows how to convert an uploaded image into a reference image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;convert_image_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AzureOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-04-01-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;azure_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AZURE_IMAGE_MODEL_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AZURE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Convert the given image to a simplified outline vector .svg image.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-image-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;image_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;image_base64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;b64_json&lt;/span&gt;
    &lt;span class="n"&gt;image_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_base64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp/vector_image.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp/vector_image.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Develop Code to Run the AI Agent for Instruction Analysis
&lt;/h3&gt;

&lt;p&gt;The following code snippet shows how to leverage the AI agent to generate a design recommendation based on the instruction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.identity&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_bearer_token_provider&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.ai.projects&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;azure.core.credentials&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AzureKeyCredential&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_instruction_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_type&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are a designer and will review the design instruction by using file search tool and generate a design recommendation.
    Don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t ask the user any questions, just try your best to generate the recommendation based on the report review by following guidelines below:
    • One sentence to describe current and forecasting future trends (colors, silhouettes, patterns, themes).
    • One sentence to describe target customer needs, preferences, and lifestyle.
    • One sentence to describe any other relevant information, such as background, practices, and styles.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Azure AI Project client
&lt;/span&gt;    &lt;span class="n"&gt;agent_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AIProjectClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_connection_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;DefaultAzureCredential&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;conn_str&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_PROJECT_CONNECTION_STRING&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Setup the Azure Monitor for the Azure AI Project agent
&lt;/span&gt;    &lt;span class="n"&gt;monitor_connection_string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;telemetry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_connection_string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;configure_azure_monitor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;connection_string&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;monitor_connection_string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Setup the OpenTelemetry tracer
&lt;/span&gt;    &lt;span class="n"&gt;tracer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_tracer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tracer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_as_current_span&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze_instruction_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Azure AI Project agent
&lt;/span&gt;        &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AZURE_AGENT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;thread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_thread&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;image_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Instructions: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_instructions&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt; Request: What recommendations can you provide for &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; design&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_and_process_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;agent_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_messages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;thread_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;thread&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="c1"&gt;# The design recommendation response from the AI model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7. Develop Code to Generate Design Images
&lt;/h3&gt;

&lt;p&gt;The next component is to generate the design image(s) based on the user input, reference image, and instruction. The generated design images will be unique and personalized designs that meet the design requirements. The following code snippet shows how to generate the design based image(s).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_image_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recommendation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_designs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reference_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        You are a professional image designer with deep understanding of patterns, styles, and audience preferences.
        Your task is to generate a design image based on the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s thoughts and ideas of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; and product type of &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_type&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
        and the design recommendation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;recommendation&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, which is based on an instruction.
        Adhere to the image pattern from the given vector image. 
        Make the image a realistic image that would be found in an online website. 
        Make the background white.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AzureOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-04-01-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;azure_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AZURE_IMAGE_MODEL_ENDPOINT&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AZURE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;image_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-image-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1024x1024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_designs&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;edit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-image-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reference_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1024x1024&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;high&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_designs&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;image_base64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b64_json&lt;/span&gt;
        &lt;span class="n"&gt;image_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_base64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# save each generated image to a file
&lt;/span&gt;        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp/generated_image_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following image shows an example of the generated design image based on the user input, reference image, and instruction.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzpbtidrsy8l5jkc0q6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgzpbtidrsy8l5jkc0q6h.png" alt=" " width="200" height="200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Develop Streamlit UI to Interact with the AI Agents
&lt;/h3&gt;

&lt;p&gt;Finally, we will develop a Streamlit UI to interact with the AI agent and generate design images. For the details of how to develop a Streamlit UI, please refer to the &lt;a href="https://docs.streamlit.io/" rel="noopener noreferrer"&gt;Streamlit documentation&lt;/a&gt;. Another place to explore on how to develop a Streamlit UI for AI applications is the &lt;a href="https://streamlit.io/generative-ai" rel="noopener noreferrer"&gt;Build powerful generative AI apps&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Azure AI Foundry offers a robust, scalable, and integrated platform for building, deploying, and managing AI agents across a wide range of industries and use cases. Its support for multi-agent orchestration, seamless integration with both proprietary and open-source models, and built-in monitoring and governance make it an ideal choice for organizations seeking to accelerate their AI adoption and innovation.&lt;/p&gt;

&lt;p&gt;This post has contributed a clear explanation of why Azure AI Foundry stands out among cloud-based and open-source AI agent frameworks, highlighting its unique advantages in flexibility, interoperability, and enterprise readiness. Additionally, the step-by-step guide provided here serves as a practical resource for anyone looking to build AI agents—from initial setup to deployment—using Azure AI Foundry. By following these best practices, organizations can unlock new opportunities for automation, insight, and collaboration, positioning themselves for long-term success in the evolving landscape of artificial intelligence.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://devblogs.microsoft.com/foundry/integrating-azure-ai-agents-mcp/" rel="noopener noreferrer"&gt;Introducing Model Context Protocol (MCP) in Azure AI Foundry&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/@eitansela/build-a-mcp-client-using-azure-ai-foundry-and-openai-agents-sdk-6c8e372f3a6a" rel="noopener noreferrer"&gt;Build a MCP client using Azure AI Foundry and OpenAI Agents SDK&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://devblogs.microsoft.com/foundry/integrating-azure-ai-agents-mcp-typescript/" rel="noopener noreferrer"&gt;Create an MCP Server with Azure AI Agent Service&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>agentaichallenge</category>
      <category>azure</category>
    </item>
    <item>
      <title>Building AI Agents: Semantic Integration of Structured and Unstructured Data using OpenAI Agent SDK</title>
      <dc:creator>Yaohua Chen</dc:creator>
      <pubDate>Thu, 13 Mar 2025 00:17:39 +0000</pubDate>
      <link>https://dev.to/imaginex/building-ai-agents-semantic-integration-of-structured-and-unstructured-data-using-openai-agent-sdk-5641</link>
      <guid>https://dev.to/imaginex/building-ai-agents-semantic-integration-of-structured-and-unstructured-data-using-openai-agent-sdk-5641</guid>
      <description>&lt;h2&gt;
  
  
  What is this about?
&lt;/h2&gt;

&lt;p&gt;This post is about building AI agents that can semantically integrate structured and unstructured data for advanced search and analysis. The ability to search and analyze data is becoming increasingly important in today's data-driven world. With the explosion of data sources and the increasing complexity of data formats, it is becoming more and more difficult for humans to manually search and analyze data. AI agents can help to automate this process, allowing organizations to quickly and accurately extract insights from their data and web. As a demonstration, an agentic application proposed in this post shows capabilities of leveraging a set of tools and techniques to search and analyze structured and unstructured data. The code for this application is available on &lt;a href="https://github.com/chen115y/PoC-Projects/blob/master/AI_Agents/OpenAI_Agent_SDK/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the problem?
&lt;/h2&gt;

&lt;p&gt;The problem is that organizations are struggling to search and analyze structured and unstructured data together. Traditional methods of searching and analyzing data, such as keyword-based search engines and manual data processing tools, are no longer sufficient to handle the dynamic user requests as well as volume and complexity of data that organizations are dealing with. These methods are time-consuming, require a lot of engineering manpower, and the results are not always accurate, relevant, and up-to-date due to the insufficient understanding of the user's query or request, context and meaning of the data.&lt;/p&gt;

&lt;p&gt;For example, as an analyst in a healthcare organization, you may need to search and analyze patient records, medical reports, and research papers to identify patterns, trends, and insights that can help improve patient care. However, the data is stored in different formats, such as databases, spreadsheets, text documents, images, and videos, making it difficult to search and analyze the data effectively. Some organizations may put a lot of engineering effort to convert unstructured data into structured or key-word indexing data, but this is time-consuming and very costly. However, even with structured or key-word indexing data, the traditional methods of searching and analyzing data are not sufficient to handle the dynamic user requests. For instance, when you are trying to query a database to find the average age of patients with a specific medical condition, you may need to write complex SQL queries and join multiple tables, which can be time-consuming and error-prone, even if you are technically capable of doing so. And the results? They may still not be accurate and relevant, and even misleading due to the insufficient understanding of your request and background context (e.g., orgnaizational background or business scope).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why is this important?
&lt;/h2&gt;

&lt;p&gt;Combining structured and unstructured data in the way of semantic integration allows organizations to extract insights quickly and accurately, leading to better decision-making and actionable insights.&lt;/p&gt;

&lt;p&gt;For example, as a healthcare analyst, you need to identify patterns and trends to improve patient care. By submitting your plain-language query, the system can automatically understand it semantically, retrieve and integrate relevenat data from various sources like patient records and medical reports, analyze the data, and present the results in an easy-to-understand format. This helps you make better decisions and draft reports with actionable insights for patient care improvement quickly.&lt;/p&gt;

&lt;p&gt;With this type of system, organizations can improve their search and analysis capabilities, enhance decision-making, increase productivity, drive innovation, improve customer experience, and gain a competitive advantage. The key of this type of system is how it can accurately understand the user's requests or queries, related context, and then retrieve the relevant data from structured and unstructured data sources, analyze the data, and present the results in a meaningful way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the semantic integration of structured and unstructured data?
&lt;/h2&gt;

&lt;p&gt;The semantically integration of structured and unstructured data is the process of combining structured data, such as databases and spreadsheets, with unstructured data, such as text documents, images, web pages, and videos, into a unified, meaningful structure by understanding and preserving the user requests or queries, context, meaning, and relationships of the data. Semantic integration focuses on the meaning and context of data rather than just its structure. This includes linking related entities, resolving conflicts in data interpretations, and ensuring interoperability between systems and data. For example, for a healthcare organization, semantic integration could involve combining structured patient records with unstructured medical reports (e.g., PDF files) to provide a comprehensive view of a patient's health history.&lt;/p&gt;

&lt;p&gt;The processing key components of semantic integration include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;User Query Understanding&lt;/strong&gt;: Understanding the user's query and intent is the first step in semantic integration. This involves analyzing the user's query to determine the context, meaning, and relationships of the data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Retrieval and Integration&lt;/strong&gt;: Retrieving and integrating data from a variety of sources, including interal document stores, relational databases, and crawled web pages, is the next step in semantic integration. This involves combining data from different sources into a unified, meaningful structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Analysis and presentation&lt;/strong&gt;: Analyzing, interpreting, and presenting the integrated data is the final step in semantic integration. This involves extracting insights from the data, such as summaries, classifications, and predictions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Accordingly, the technical key components of semantic integration include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;User Interactive Agent&lt;/strong&gt;: The user interactive agent is responsible for understanding the user's query and intent. This involves analyzing the user's query to determine the context, meaning, and which tools or function callings it could leverage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Structured Data Retrieval Agent&lt;/strong&gt;: The structured data retrieval agent is responsible for retrieving and integrating structured data from a variety of sources, such as databases and spreadsheets. This involves combining data from different sources into a unified, meaningful structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unstructured Data Retrieval Agent&lt;/strong&gt;: The unstructured data retrieval agent is responsible for retrieving and integrating unstructured data from a variety of sources, such as text documents, images, and videos. This usually leverages RAG (Retrieve, Analyze, Generate) approach and Retrival-Augmented Generation (RAG) technology.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Web Search Agent&lt;/strong&gt;: The web search agent is responsible for crawling and indexing web pages to retrieve and integrate data from the Internet. This involves searching for relevant data on the web and extracting insights from the data.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following diagram illustrates the technical key components of semantic integration of structured and unstructured data:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg80a5nuruqb1neyq7zkq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg80a5nuruqb1neyq7zkq.png" alt=" " width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the other benefits?
&lt;/h2&gt;

&lt;p&gt;Beside the benefits mentioned above, the semantic integration of structured and unstructured data provides a number of other benefits, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Organizations can easily adapt to changing data sources, formats, and user requests. The system can also be easily extended to support new features and capabilities as well as integrated with other systems and tools.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficiency&lt;/strong&gt;: Organizations can process large amounts of data quickly and accurately because the system can automatically understand the user's requests, context, and meaning of the data and then use the appropriate tools and techniques to search and analyze the data based on the user's requests. The data volume and complexity are small and no longer a barrier to search and analyze data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relevancy&lt;/strong&gt;: Organizations can extract relevant and valuable insights from their data. The system can automatically understand the user's requests, context, and meaning of the data, and then retrieve and integrate the relevant data from structured and unstructured data sources, analyze the data, and present the results in a meaningful way.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Real-time insights&lt;/strong&gt;: Organizations can easily extract real-time insights from their data and Internet or web, allowing them to quickly respond to changing market conditions and user requests, and further improve their decision-making and customer experience.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How can it be implemented?
&lt;/h2&gt;

&lt;p&gt;In this post, we will use &lt;a href="https://platform.openai.com/docs/guides/agents-sdk" rel="noopener noreferrer"&gt;OpenAI Agent SDK&lt;/a&gt; and other techniques to build an AI Agent application that can semantically integrate structured and unstructured data for advanced search and analysis. The application is built on top of the Python programming language and uses a number of popular libraries, including PandasAI, OpenAI, etc. The code for this application is available on &lt;a href="https://github.com/chen115y/PoC-Projects/blob/master/AI_Agents/OpenAI_Agent_SDK/" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why use OpenAI Agent SDK?
&lt;/h3&gt;

&lt;p&gt;OpenAI has introduced new tools, including &lt;a href="https://github.com/openai/openai-agents-python" rel="noopener noreferrer"&gt;Agent SDK&lt;/a&gt;, to help developers and enterprises build, deploy, and scale AI agents. The Agents SDK is an open-source framework that simplifies the orchestration of single-agent and multi-agent workflows. It includes features like intelligent handoffs, configurable guardrails for safety, and observability tools for tracking performance. It supports built-in web search, file search, and computer use, making it easier to build agents that perform various tasks autonomously. These tools address challenges in building production-ready AI agents, such as complex orchestration and extensive prompt iteration. By replacing manual prompt engineering and custom scripting with advanced tools, OpenAI makes agent development faster and more practical. These innovations empower businesses to automate complex workflows without extensive technical expertise, transforming operations across various industries.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the approach for unstructured data
&lt;/h3&gt;

&lt;p&gt;One of the key techniques used in the application is the RAG approach, which stands for Retrival-Augmented Generation (RAG). The RAG approach is a three-step process that allows agents to quickly and accurately extract insights from unstructured data. In the Retrieve step, the agent retrieves relevant data from a variety of sources, including text documents, images, and videos based on the user's request. In the Analyze step, the agent analyzes the retrieved data using a variety of techniques, including natural language processing, computer vision, and machine learning. In the Generate step, the agent generates insights from the analyzed data, such as summaries, classifications, and predictions. The RAG technology is a powerful tool for searching and analyzing unstructured data, allowing agents to quickly extract valuable insights from their data.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the approach for structured data
&lt;/h3&gt;

&lt;p&gt;The application also uses a number of techniques to search and analyze structured data. One of the key techniques used in the application is PandasAI, which is a powerful library for data manipulation and analysis. PandasAI provides a number of tools for working with structured data, including data structures, data manipulation functions, and data analysis functions. By using PandasAI, agents can quickly and easily process structured data, allowing them to extract valuable insights from their data.&lt;/p&gt;

&lt;p&gt;Notice: we don't use the RAG approach here since structured data is already organized in a way that makes it easy to search and analyze. Structured data is typically stored in databases or spreadsheets, which provide a consistent format for storing and retrieving data. Because structured data is already organized, agents can quickly and easily extract insights from structured data without the need for the RAG approach. Instead, agents can use tools like PandasAI to quickly process structured data and extract valuable insights.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which agentic pattern does it use?
&lt;/h3&gt;

&lt;p&gt;In this agentic application, we use the "agent orchestration" pattern. This pattern is used to build agents that can perform complex tasks by breaking them down into smaller, more manageable sub-tasks. Each sub-task is handled by a separate agent, which is responsible for retrieving and processing the data. The agents work together to complete the task, passing data back and forth as needed. For example, you might have a frontline agent that receives a user's request, and then hands off to a specialized agent based on the language of the request. The specialized agent then retrieves and processes the data, and hands off the results back to the frontline agent, which presents the results to the user. This pattern is ideal for building agents that need to perform multiple tasks in parallel or in sequence, such as searching and analyzing structured and unstructured data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The agentic application proposed in this post provides a set of tools and techniques that can be used to build intelligent agents that can search and analyze structured and unstructured data. It is capable of processing large amounts of data quickly and accurately, allowing organizations to extract valuable insights from their data. The application is designed to be flexible and extensible, allowing developers to easily add new features and capabilities to their agents. The application is built on top of the Python programming language and uses a number of popular libraries, including PandasAI, OpenAI, etc. The application is designed to be easy to use and understand, making it ideal for both beginners and experienced developers.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>agentic</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
