<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dextra Labs</title>
    <description>The latest articles on DEV Community by Dextra Labs (@dextralabs).</description>
    <link>https://dev.to/dextralabs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3662653%2F4a16ca71-2863-42bd-8d70-cfc2598122b1.png</url>
      <title>DEV Community: Dextra Labs</title>
      <link>https://dev.to/dextralabs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dextralabs"/>
    <language>en</language>
    <item>
      <title>Build a No-Code AI Agent in 30 Minutes Using n8n + Claude (Full Walkthrough)</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Tue, 21 Apr 2026 15:43:31 +0000</pubDate>
      <link>https://dev.to/dextralabs/build-a-no-code-ai-agent-in-30-minutes-using-n8n-claude-full-walkthrough-3cid</link>
      <guid>https://dev.to/dextralabs/build-a-no-code-ai-agent-in-30-minutes-using-n8n-claude-full-walkthrough-3cid</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Monitor a Slack channel, summarise threads, post daily digests, all without writing a single line of code.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I want to tell you about a Friday afternoon problem.&lt;/p&gt;

&lt;p&gt;Our team's main Slack channel had become unmanageable. Not because people were saying too much, because they were saying the right things at the wrong times. Someone would share a critical decision at 9am. Someone else would ask a question about it at 2pm without having seen the original message. By the end of the day, the same context had been repeated four times in different threads and nobody had a clean summary of what had actually been decided.&lt;/p&gt;

&lt;p&gt;The obvious solution was a daily digest. A summary of important threads, decisions and open questions, posted at end of day so everyone could start tomorrow with context rather than archaeology.&lt;/p&gt;

&lt;p&gt;The less obvious solution was building it in 30 minutes on a Friday afternoon using n8n and Claude and then never thinking about it again.&lt;/p&gt;

&lt;p&gt;This is that walkthrough. By the end of it you'll have a working AI agent that monitors a Slack channel, identifies meaningful threads, summarises them with Claude and posts a clean daily digest. No code required. Replicable during a lunch break.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What You'll Need&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;n8n&lt;/strong&gt;, either self-hosted (Docker is the easiest path) or the cloud version at n8n.io. Free tier works for this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Anthropic API key&lt;/strong&gt;, get one at console.anthropic.com. You'll use maybe $0.10 of credits building this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Slack workspace&lt;/strong&gt; where you have permission to add apps and read channel history.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;30 minutes and a coffee.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's genuinely it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Get n8n Running&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you're using n8n cloud, skip this. If you're self-hosting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bash
docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; n8n &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 5678:5678 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; ~/.n8n:/home/node/.n8n &lt;span class="se"&gt;\&lt;/span&gt;
  n8nio/n8n
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:5678&lt;/code&gt; in your browser. Create your account. You'll land on the main workflow canvas, a blank grid that's about to become your agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you're looking at&lt;/strong&gt;: n8n's canvas is where you build automation workflows by connecting nodes. Each node does one thing, fetch data, transform it, call an API, send a message. You connect them left to right and data flows through the chain. That's the entire mental model.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Create a New Workflow&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Click &lt;strong&gt;"New Workflow"&lt;/strong&gt; in the top right. Name it something useful, "Slack Daily Digest Agent" works fine.&lt;br&gt;
You'll see a single "+" button in the centre of the canvas. This is where your first node goes.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Add the Schedule Trigger&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Every agent needs something that kicks it off. Ours runs once a day at 5pm.&lt;br&gt;
Click the "+" button and search for &lt;strong&gt;"Schedule Trigger."&lt;/strong&gt; Add it.&lt;br&gt;
In the node settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Trigger interval&lt;/strong&gt;: Days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Days between triggers&lt;/strong&gt;: 1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger at hour&lt;/strong&gt;: 17 (5pm)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger at minute&lt;/strong&gt;: 0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Click &lt;strong&gt;"Save."&lt;/strong&gt;&lt;br&gt;
This node will now fire your workflow every day at 5pm. Nothing else needed, n8n handles the scheduling infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area&lt;/strong&gt;: [Schedule Trigger node, shows the interval settings with "Days: 1" and "Hour: 17" configured. Clean n8n interface, dark mode if available.]&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Connect Slack and Pull Channel Messages&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now we need to fetch today's messages from your Slack channel.&lt;br&gt;
Click "+" after the Schedule Trigger and search for &lt;strong&gt;"Slack."&lt;/strong&gt; Select the &lt;strong&gt;"Get Many Messages"&lt;/strong&gt; action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Credential setup&lt;/strong&gt; (first time only):&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;"Create New Credential"&lt;/strong&gt;&lt;br&gt;
Select &lt;strong&gt;"OAuth2"&lt;/strong&gt;&lt;br&gt;
Follow the Slack OAuth flow, you'll need to create a Slack app at api.slack.com/apps with &lt;code&gt;channels:history&lt;/code&gt; and &lt;code&gt;channels:read&lt;/code&gt; permissions&lt;br&gt;
Once authorised, the credential saves and you won't touch it again&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Node configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operation&lt;/strong&gt;: Get Many Messages&lt;br&gt;
&lt;strong&gt;Channel&lt;/strong&gt;: Select your target channel from the dropdown&lt;br&gt;
&lt;strong&gt;Limit&lt;/strong&gt;: 100 (catches a full day of messages)&lt;br&gt;
&lt;strong&gt;Additional Fields → Oldest&lt;/strong&gt;: &lt;code&gt;{{ $now.startOf('day').toISO() }}&lt;/code&gt; (this expression pulls only today's messages)&lt;/p&gt;

&lt;p&gt;This expression is the only "code" in the whole workflow and it's just a timestamp filter. n8n's expression syntax is readable enough that you don't need to understand it deeply to use it.&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;"Test Step"&lt;/strong&gt; to verify it's pulling messages. You should see today's Slack messages appear in the output panel on the right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area:&lt;/strong&gt; [Slack node configuration, channel dropdown selected, limit set to 100, oldest expression visible. Output panel showing sample messages on the right.]&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Filter for Meaningful Threads&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Not every message deserves to be in the digest. Single emoji reactions and "thanks!" messages don't need to be summarised. We want threads with actual substance.&lt;/p&gt;

&lt;p&gt;Click "+" and add an &lt;strong&gt;"IF"&lt;/strong&gt; node.&lt;br&gt;
&lt;strong&gt;Condition:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Value 1:&lt;/strong&gt; {{ $json.reply_count }}&lt;br&gt;
&lt;strong&gt;Operation:&lt;/strong&gt; Greater than&lt;br&gt;
&lt;strong&gt;Value 2:&lt;/strong&gt; 2&lt;/p&gt;

&lt;p&gt;This passes only messages that generated at least 3 replies, a reasonable proxy for "this was a real conversation." You can tune the number based on your channel's activity level.&lt;/p&gt;

&lt;p&gt;The IF node has two output paths: &lt;strong&gt;"True"&lt;/strong&gt; (messages with threads) and &lt;strong&gt;"False"&lt;/strong&gt; (everything else). Connect only the &lt;strong&gt;True&lt;/strong&gt; path to the next node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area&lt;/strong&gt;: [IF node, condition showing reply_count &amp;gt; 2. Two output paths visible, True path highlighted.]&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 6: Fetch the Full Thread Content&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We have the parent messages. Now we need the replies.&lt;br&gt;
Add another Slack node, this time with the &lt;strong&gt;"Get Replies"&lt;/strong&gt; action.&lt;br&gt;
&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Channel&lt;/strong&gt;: Same channel as before&lt;br&gt;
&lt;strong&gt;Timestamp&lt;/strong&gt;: &lt;code&gt;{{ $json.ts }}&lt;/code&gt; (this pulls the thread ID from the previous node)&lt;/p&gt;

&lt;p&gt;This fetches the complete thread for each qualifying message. n8n automatically loops through each message from the filter step and fetches its replies, no manual iteration needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area&lt;/strong&gt;: [Slack Get Replies node, timestamp expression visible, connected to the IF node's True path.]&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 7: Format the Data for Claude&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we send threads to Claude, we need to structure the data cleanly. Add a &lt;strong&gt;"Code"&lt;/strong&gt; node, yes, this one has a tiny bit of JavaScript, but it's copy-paste simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;javascriptconst&lt;/span&gt; 

&lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;threadText&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Unknown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLocaleTimeString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;en-US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
      &lt;span class="na"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2-digit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
      &lt;span class="na"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;2-digit&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; 
    &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`[&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;] &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;thread_content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;threadText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;message_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;channel&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unknown&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This formats the thread into a clean, readable block that Claude can process efficiently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area&lt;/strong&gt;: [Code node, JavaScript visible in the editor. Clean, minimal. Output panel showing formatted thread_content string.]&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 8: Send to Claude for Summarisation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the interesting part. Add an &lt;strong&gt;"HTTP Request"&lt;/strong&gt; node, we'll call the Anthropic API directly.&lt;br&gt;
&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Method&lt;/strong&gt;: POST&lt;br&gt;
&lt;strong&gt;URL&lt;/strong&gt;: &lt;code&gt;https://api.anthropic.com/v1/messages&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: Header Auth&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Name&lt;/strong&gt;: &lt;code&gt;x-api-key&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;Value&lt;/strong&gt;: Your Anthropic API key&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Headers&lt;/strong&gt;: &lt;code&gt;Add anthropic-version: 2023-06-01&lt;/code&gt; and &lt;code&gt;content-type: application/json&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Body (JSON):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;json&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-sonnet-4-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Summarise this Slack thread in 3-5 bullet points. Focus on decisions made, questions raised and action items. Be concise.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s2"&gt;Thread:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;{{ $json.thread_content }}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;{{ $json.thread_content }}&lt;/code&gt; pulls the formatted thread from the previous node. Claude will return a clean summary for each thread.&lt;/p&gt;

&lt;p&gt;Click &lt;strong&gt;"Test Step"&lt;/strong&gt;, you should see Claude's summary appear in the output. First time seeing this work is genuinely satisfying.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Screenshot area&lt;/strong&gt;: [HTTP Request node, URL and headers configured. Output panel showing Claude's JSON response with the summary text visible in content[0].text.]&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 9: Extract Claude's Response&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Claude's response comes back as a JSON object. We need to pull out the actual summary text.&lt;br&gt;
Add another &lt;strong&gt;"Code"&lt;/strong&gt; node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;javascript&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;summary&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Short, simple, does exactly one thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 10: Aggregate All Summaries&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We've been processing each thread individually. Now we need to collect all summaries into one digest message.&lt;/p&gt;

&lt;p&gt;Add a &lt;strong&gt;"Merge"&lt;/strong&gt; node with mode set to &lt;strong&gt;"Combine All"&lt;/strong&gt;, this waits for all thread summaries to complete and combines them into a single array.&lt;/p&gt;

&lt;p&gt;Then add one final &lt;strong&gt;"Code"&lt;/strong&gt; node to format the digest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;javascript&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;summaries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;$input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toLocaleDateString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;en-US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
  &lt;span class="na"&gt;weekday&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;long&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="na"&gt;year&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;numeric&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="na"&gt;month&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;long&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="na"&gt;day&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;numeric&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; 
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;digestSections&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;summaries&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="s2"&gt;`*Thread &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:*\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s1"&gt;---&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullDigest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;`*📋 Daily Channel Digest  &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;date&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;*\n\n&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;digestSections&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;\n\n_Generated by your Slack Digest Agent_`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
  &lt;span class="na"&gt;json&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fullDigest&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}];&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 11: Post the Digest to Slack&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Final node. Add a &lt;strong&gt;Slack&lt;/strong&gt; node with the &lt;strong&gt;"Send a Message"&lt;/strong&gt; action.&lt;br&gt;
&lt;strong&gt;Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Channel&lt;/strong&gt;: Your digest destination (can be the same channel or a dedicated #daily-digest channel)&lt;br&gt;
&lt;strong&gt;Message Text&lt;/strong&gt;: &lt;code&gt;{{ $json.digest }}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That's it. That's the whole agent.&lt;br&gt;
&lt;strong&gt;Screenshot area&lt;/strong&gt;: [Final Slack node, message text expression visible. Clean configuration panel.]&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 12: Activate and Walk Away&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Click the toggle in the top right of your workflow from &lt;strong&gt;"Inactive"&lt;/strong&gt; to &lt;strong&gt;"Active."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your agent will now run every day at 5pm, pull meaningful threads from your Slack channel, summarise each one with Claude and post a clean digest, without you doing anything.&lt;/p&gt;

&lt;p&gt;The full workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Schedule Trigger → Slack (Get Messages) → IF (reply_count &amp;gt; 2) 
→ Slack (Get Replies) → Code (Format) → HTTP Request (Claude) 
→ Code (Extract) → Merge → Code (Format Digest) → Slack (Post)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten nodes. Thirty minutes. Working AI agent.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What to Customise&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A few quick wins once the base agent is running:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change the filter threshold&lt;/strong&gt;: busy channels might need &lt;code&gt;reply_count &amp;gt; 5&lt;/code&gt;. Quiet channels might need &lt;code&gt;reply_count &amp;gt; 1&lt;/code&gt;. Tune to your team's volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add a priority label&lt;/strong&gt;: pass the thread back through Claude with a second prompt asking it to classify the summary as "Decision / Question / FYI" and prepend that label to each section.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-channel digest&lt;/strong&gt;:  duplicate the Slack + IF + Replies node group, point them at different channels, merge all summaries before the final formatting step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webhook trigger instead of schedule&lt;/strong&gt;: replace the Schedule Trigger with a Webhook node if you want to trigger the digest on demand via a Slack slash command.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Going Further: Claude's MCP Protocol&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What you've built here is a solid single-purpose agent. It does one thing reliably and it'll keep doing it without your attention.&lt;/p&gt;

&lt;p&gt;The next level, where agents can interact with multiple enterprise systems, maintain context across sessions and coordinate with other agents, is Claude's Model Context Protocol. For enterprise-grade integrations using Claude's MCP, the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/claude-code-mcp-enterprise-ai-integrations/" rel="noopener noreferrer"&gt;Claude MCP enterprise AI integrations &lt;/a&gt;&lt;/strong&gt;architecture guide from Dextra Labs covers the full implementation pattern.&lt;/p&gt;

&lt;p&gt;For teams wanting to go deeper on n8n agent architecture specifically, the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/how-to-build-ai-agent-with-n8n/" rel="noopener noreferrer"&gt;how to build AI agents with n8n&lt;/a&gt;&lt;/strong&gt; guide covers multi-agent workflows, error handling patterns and production deployment considerations that go beyond what fits in a lunch-break tutorial.&lt;/p&gt;

&lt;p&gt;This is a basic agent. For enterprise-grade integrations using Claude's MCP protocol, Dextra Labs published the full architecture guide, covering multi-agent coordination, persistent context and production deployment patterns for teams running these workflows at scale.&lt;/p&gt;

&lt;p&gt;Published by Dextra Labs | AI Consulting &amp;amp; Enterprise Agent Development&lt;/p&gt;

</description>
      <category>ai</category>
      <category>automation</category>
      <category>programming</category>
      <category>webdev</category>
    </item>
    <item>
      <title>5 Alternatives to OpenClaw If You Need Enterprise-Grade Security</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Fri, 27 Mar 2026 18:51:13 +0000</pubDate>
      <link>https://dev.to/dextralabs/5-alternatives-to-openclaw-if-you-need-enterprise-grade-security-3p1o</link>
      <guid>https://dev.to/dextralabs/5-alternatives-to-openclaw-if-you-need-enterprise-grade-security-3p1o</guid>
      <description>&lt;p&gt;OpenClaw is impressive. It's also not built for teams with SSO requirements, audit logs and data isolation mandates. Here's what to use instead.&lt;/p&gt;

&lt;p&gt;Let me save you three weeks of evaluation time.&lt;/p&gt;

&lt;p&gt;You heard about OpenClaw. You watched the demo videos. A local AI agent framework that runs entirely on your infrastructure, no data leaving your environment, full tool use, open source, it looked like exactly what your team needed. Then someone from your security team asked a few questions and the enthusiasm in the room dropped noticeably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where are the audit logs?&lt;/strong&gt; How does it handle SSO? What's the data isolation model between users? Can it integrate with our existing RBAC implementation? Is there a SOC 2 report?&lt;/p&gt;

&lt;p&gt;These are not unreasonable questions. They're the standard questions that any enterprise security team asks about any new piece of infrastructure. And the honest answer for OpenClaw in its current form is: these capabilities either don't exist, require significant custom engineering to implement, or are on the roadmap without committed timelines.&lt;/p&gt;

&lt;p&gt;This is not a criticism of OpenClaw. It's a framework built by developers for developers, optimised for capability demonstration and local deployment flexibility. It does what it says on the tin. What it says on the tin is not "enterprise-grade security-compliant agent infrastructure."&lt;/p&gt;

&lt;p&gt;If you need that, here's what to evaluate instead, with an honest comparison of where each option fits.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What "Enterprise-Grade Security" Actually Means Here&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before the alternatives, it helps to be precise about what we're evaluating against. When enterprise security teams evaluate agent platforms, the requirements cluster around five areas:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication and access control&lt;/strong&gt; — SSO integration (SAML 2.0, OIDC), MFA enforcement, role-based access control with granular permissions, service account management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Audit logging&lt;/strong&gt; — immutable, queryable logs of every agent action, tool call, data access and user interaction. Exportable to SIEM systems. Retention policies that meet compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data isolation&lt;/strong&gt; — tenant-level data separation, configurable data residency, encryption at rest and in transit with customer-managed keys where required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security compliance&lt;/strong&gt; — SOC 2 Type II certification (or equivalent), penetration testing recency, vulnerability disclosure programme, SLA for security patches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network and deployment security&lt;/strong&gt; — VPC deployment support, private endpoints, air-gapped deployment options for highly regulated environments.&lt;/p&gt;

&lt;p&gt;OpenClaw scores poorly on most of these out of the box. The alternatives below score well on most of them, with different trade-offs in capability, cost and deployment complexity.&lt;/p&gt;

&lt;p&gt;For a detailed technical breakdown of &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/what-is-openclaw-self-hosted-ai-agent-2026/" rel="noopener noreferrer"&gt;what OpenClaw is and how its self-hosted architecture works&lt;/a&gt;&lt;/strong&gt; before you make a comparison decision, the Dextra Labs explainer covers the architecture in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Comparison Table&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7l9jnlkc32ye1gv7h8h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7l9jnlkc32ye1gv7h8h.png" alt=" " width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let's go through each one properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. LangChain Enterprise (LangSmith + LangServe)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Teams already using LangChain who need to move from prototype to production with audit capability&lt;br&gt;
LangChain's open-source framework is where most teams start building agents. LangChain Enterprise adds the production and compliance layer on top, LangSmith for observability, tracing and audit logging, LangServe for deployment with authentication and the enterprise support tier for the compliance documentation your procurement team needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The security story:&lt;/strong&gt;&lt;br&gt;
LangSmith's tracing infrastructure captures every LLM call, tool invocation, input, output and intermediate step in a queryable audit log. For a compliance team asking "what did your AI agent do and why," LangSmith gives you the evidence trail. SSO via SAML 2.0 and OIDC is supported. Role-based access control is granular enough for most enterprise requirements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AgentExecutor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;create_react_agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_anthropic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatAnthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langsmith&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LANGCHAIN_TRACING_V2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LANGCHAIN_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_langsmith_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LANGCHAIN_PROJECT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise-agent-prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Every action below is automatically traced
# and logged to LangSmith with full audit trail
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatAnthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;search_tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calculator_tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_query_tool&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hwchase17/react&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;agent_executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Return intermediate steps for audit log
&lt;/span&gt;    &lt;span class="n"&gt;return_intermediate_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Handle parsing errors gracefully
&lt;/span&gt;    &lt;span class="n"&gt;handle_parsing_errors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent_executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse Q3 customer churn and identify top 3 factors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Full trace available in LangSmith dashboard
# including every tool call and LLM reasoning step
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it falls short&lt;/strong&gt;: Self-hosting LangSmith Enterprise is possible but operationally complex. If your requirement is air-gapped deployment with no external dependencies, the setup effort is significant. The open-source LangSmith self-hosted version exists but lacks the enterprise security features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;: Contact sales. Roughly $500-2000/month depending on trace volume and team size.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Google Vertex AI Agent Builder&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Teams on GCP who need the strongest possible compliance posture and are comfortable with a managed service&lt;br&gt;
Vertex AI's agent infrastructure inherits Google Cloud's compliance certifications, SOC 2, ISO 27001, HIPAA, FedRAMP (at higher tiers)  and its security model. If your compliance requirements include any of the major certifications and you're already on GCP, this is the path of least resistance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The security story:&lt;/strong&gt;&lt;br&gt;
Data never leaves your GCP project. VPC Service Controls can isolate your agent infrastructure from the public internet entirely. Cloud Audit Logs captures every API call with immutable records exportable to Chronicle or your SIEM. IAM integration handles SSO, MFA and fine-grained permissions via Google Workspace or your external IdP.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;aiplatform&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.preview&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;reasoning_engines&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize with your project's security configuration
&lt;/span&gt;&lt;span class="n"&gt;aiplatform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-gcp-project&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# All data stays within your GCP project
&lt;/span&gt;    &lt;span class="c1"&gt;# VPC SC controls apply automatically
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define agent with tool use
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EnterpriseAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reasoning_engines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Queryable&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent logic here
&lt;/span&gt;        &lt;span class="c1"&gt;# All interactions logged to Cloud Audit Logs
&lt;/span&gt;        &lt;span class="k"&gt;pass&lt;/span&gt;

&lt;span class="c1"&gt;# Deploy to managed infrastructure
# Inherits GCP SOC 2 / ISO 27001 compliance
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reasoning_engines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReasoningEngine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;EnterpriseAgent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;requirements&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&amp;gt;=0.20.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enterprise-agent-prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarise open support tickets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it falls short&lt;/strong&gt;: GCP lock-in is real. If your infrastructure is multi-cloud or on-premises, Vertex AI's compliance story doesn't extend outside GCP. The agent builder is also less flexible than open frameworks, custom tool integrations require more boilerplate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;: Usage-based. Typically $0.025-0.05 per 1K characters for the agent layer, plus underlying model costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Microsoft AutoGen Enterprise&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Teams in Microsoft-heavy environments, Azure, M365, Entra ID, who need multi-agent workflows with enterprise auth&lt;br&gt;
AutoGen's multi-agent framework is genuinely powerful for complex workflows where multiple specialised agents collaborate. The enterprise version adds Azure AD (Entra ID) integration, which means SSO is essentially free if you're already on the Microsoft stack and audit logging via Azure Monitor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The security story:&lt;/strong&gt;&lt;br&gt;
Entra ID integration handles authentication, MFA and conditional access policies automatically. Azure Monitor captures agent interactions with Log Analytics workspaces for compliance reporting. Data residency is configurable to your Azure region. SOC 2 and ISO 27001 coverage through Azure's compliance certifications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonimport&lt;/span&gt; &lt;span class="n"&gt;autogen&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogen&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AssistantAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;UserProxyAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GroupChat&lt;/span&gt;

&lt;span class="c1"&gt;# Configure with Azure OpenAI or Anthropic via Azure
# All traffic stays within your Azure tenant
&lt;/span&gt;&lt;span class="n"&gt;config_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-resource.openai.azure.com/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;api_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-02-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Auth handled by Azure AD managed identity
&lt;/span&gt;        &lt;span class="c1"&gt;# No secrets in config
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;azure_ad_token_provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEFAULT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Define specialised agents for enterprise workflow
&lt;/span&gt;&lt;span class="n"&gt;code_reviewer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AssistantAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_reviewer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Review code for security vulnerabilities and standards compliance.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config_list&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;security_auditor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AssistantAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;security_auditor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="n"&gt;system_message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Assess security posture and compliance requirements.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config_list&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config_list&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;orchestrator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;UserProxyAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orchestrator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;human_input_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NEVER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;code_execution_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;work_dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sandbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_docker&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# Isolated execution
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Multi-agent workflow with full Azure Monitor logging
&lt;/span&gt;&lt;span class="n"&gt;groupchat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GroupChat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;orchestrator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code_reviewer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;security_auditor&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;max_round&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it falls short&lt;/strong&gt;: AutoGen's enterprise features are most complete when you're using Azure OpenAI as the model provider. Using Anthropic or other providers requires more configuration and you lose some of the native Azure security integration benefits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;: AutoGen framework is open source. Enterprise support and Azure infrastructure costs apply separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Haystack Enterprise (deepset)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Teams building document-heavy agent workflows, RAG pipelines, document processing, knowledge base agents, who need self-hosting with a strong security model&lt;/p&gt;

&lt;p&gt;Haystack is the most genuinely open-source option on this list from a security architecture perspective. deepset's enterprise offering adds the compliance layer on top of the open-source framework, but the framework itself is auditable in a way that matters to security teams who need to understand what their software is doing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The security story:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Haystack Enterprise offers self-hosted deployment with full data isolation, your data never touches deepset's infrastructure unless you choose the cloud offering. SSO via SAML 2.0 and OIDC. Audit logging at the pipeline level. SOC 2 Type II certified. The architecture is pipeline-based, which makes it easier to implement and audit data flow controls than more opaque agent frameworks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;haystack&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pipeline&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;haystack.components.generators&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnthropicGenerator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;haystack.components.retrievers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryBMRetriever&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;haystack.components.builders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RAGPromptBuilder&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;haystack.logging&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configure_logging&lt;/span&gt;

&lt;span class="c1"&gt;# Configure enterprise logging
# Outputs to your SIEM via structured JSON
&lt;/span&gt;&lt;span class="nf"&gt;configure_logging&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INFO&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;include_metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;# Captures user context
&lt;/span&gt;    &lt;span class="n"&gt;include_component_trace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Full pipeline trace
&lt;/span&gt;    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stdout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;            &lt;span class="c1"&gt;# Capture by your log aggregator
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Build auditable RAG agent pipeline
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retriever&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="nc"&gt;InMemoryBMRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;document_store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;document_store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_builder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;RAGPromptBuilder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;enterprise_template&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_component&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;AnthropicGenerator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Connect components — data flow is explicit and auditable
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retriever.documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_builder.documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_builder.prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generator.prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run with user context for audit trail
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retriever&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_builder&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Audit context
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it falls short&lt;/strong&gt;: Haystack is optimised for pipeline-based workflows. If your use case requires complex multi-step autonomous agent behaviour, the ReAct-style loops that OpenClaw excels at, Haystack requires more custom work to implement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing&lt;/strong&gt;: Open source framework. Enterprise support and cloud offering via deepset starting around $1,500/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. CrewAI Enterprise&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Best for&lt;/strong&gt;: Teams who need multi-agent workflows with role-based agents and want the fastest path from prototype to production&lt;br&gt;
CrewAI's framework is the newest on this list and the most rapidly developing. The enterprise offering is also the newest, SOC 2 Type II certification is in progress rather than complete, which is worth flagging for teams with strict compliance requirements. That said, the security architecture is solid and the certification is expected to complete in 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The security story:&lt;/strong&gt;&lt;br&gt;
SSO via SAML 2.0 and OIDC. Comprehensive audit logging of every agent action and crew interaction. Tenant-level data isolation in the cloud offering. Self-hosting available with a private deployment option. The role-based agent model maps naturally to enterprise organisational structures, a manager agent, a researcher agent, a writer agent, which makes access control design more intuitive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Process&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai.security&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AccessControl&lt;/span&gt;

&lt;span class="c1"&gt;# Configure enterprise security
&lt;/span&gt;&lt;span class="n"&gt;audit_logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;structured_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;include_inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;include_outputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;include_tool_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retention_days&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;  &lt;span class="c1"&gt;# Configurable for compliance
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;access_control&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AccessControl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rbac_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./config/rbac.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sso_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;okta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Or AzureAD, Google, etc.
&lt;/span&gt;    &lt;span class="n"&gt;mfa_required&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Define agents with role-based access
&lt;/span&gt;&lt;span class="n"&gt;analyst&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Senior Data Analyst&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse customer data and identify patterns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expert in customer behaviour analysis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;db_query_tool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analytics_tool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="c1"&gt;# Scoped tool access per role
&lt;/span&gt;    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db_query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;max_iter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;reporter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Report Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Produce executive-ready summaries&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;backstory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Specialises in clear business communication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;document_tool&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;allowed_tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_write&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Tasks with explicit data handling
&lt;/span&gt;&lt;span class="n"&gt;analysis_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Analyse Q3 churn data. Focus on enterprise segment.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;analyst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Structured analysis with top 5 churn factors&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Data classification for audit
&lt;/span&gt;    &lt;span class="n"&gt;data_classification&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidential&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;analyst&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reporter&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;analysis_task&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;report_task&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;process&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sequential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;audit_logger&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audit_logger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;access_control&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;access_control&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kickoff&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Where it falls short&lt;/strong&gt;: The SOC 2 gap is real for teams with strict compliance timelines. If you need certification documentation for a procurement decision this quarter, CrewAI Enterprise isn't ready yet. Check back in six months.&lt;br&gt;
&lt;strong&gt;Pricing&lt;/strong&gt;: Enterprise pricing on request. Community edition is open source and free.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Choose&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The honest decision framework comes down to three questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your infrastructure constraint?&lt;/strong&gt; GCP-locked teams should look at Vertex AI first, the compliance story is the strongest and the integration with existing GCP security tooling is seamless. Azure-heavy teams should look at AutoGen Enterprise. Infra-agnostic teams with self-hosting requirements should look at LangChain Enterprise or Haystack Enterprise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your workflow type?&lt;/strong&gt; Document-heavy RAG pipelines, Haystack. Multi-agent collaborative workflows, AutoGen or CrewAI. General-purpose agent with strong observability, LangChain Enterprise. Maximum compliance certification coverage, Vertex AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's your timeline?&lt;/strong&gt; If you need SOC 2 documentation for a procurement decision this quarter, CrewAI is out. If you can wait six months and the workflow fit is strong, it's worth reconsidering.&lt;/p&gt;

&lt;p&gt;None of these are perfect OpenClaw replacements for the developer who wants full local control with zero external dependencies. That's genuinely a trade-off, the security features that enterprise teams require add architectural complexity that pure local deployment can't accommodate.&lt;/p&gt;

&lt;p&gt;If you're still evaluating whether OpenClaw might work for your use case with some custom security engineering or whether one of these alternatives is the right fit, the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/top-openclaw-alternatives/" rel="noopener noreferrer"&gt;https://dextralabs.com/blog/top-openclaw-alternatives/&lt;/a&gt;&lt;/strong&gt; comparison from Dextra Labs covers the detailed feature matrix, pricing and deployment complexity for each option.&lt;/p&gt;

&lt;p&gt;If you're evaluating self-hosted agent platforms more broadly, Dextra Labs has a detailed breakdown of OpenClaw's architecture and where it fits in the market including the specific security gaps and what it would take to close them with custom engineering.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>security</category>
      <category>python</category>
    </item>
    <item>
      <title>PaperBanana: Automating Research Diagrams With an Agentic AI Framework</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Thu, 26 Mar 2026 04:24:52 +0000</pubDate>
      <link>https://dev.to/dextralabs/paperbanana-automating-research-diagrams-with-an-agentic-ai-framework-3ajk</link>
      <guid>https://dev.to/dextralabs/paperbanana-automating-research-diagrams-with-an-agentic-ai-framework-3ajk</guid>
      <description>&lt;p&gt;Google just shipped a framework that turns natural language into publication-ready figures. Here's how the agentic pipeline actually works, with real code.&lt;/p&gt;

&lt;p&gt;I want to tell you about the specific kind of frustration that makes researchers consider career changes.&lt;/p&gt;

&lt;p&gt;You've just finished a three-month experiment. The results are clean, the story is clear and all you need to do is produce the figures for the paper. Six hours later you're on Stack Overflow at 11pm trying to figure out why matplotlib is cutting off your axis labels in the PDF export and the actual insight you were excited about three hours ago feels very far away.&lt;/p&gt;

&lt;p&gt;PaperBanana is Google AI's answer to this. It's an agentic framework that takes natural language descriptions and produces publication-ready research figures, not rough drafts that need cleanup, but figures you can drop directly into a Nature or NeurIPS submission. The GitHub activity around it has been significant and the architecture underneath deserves attention independent of the diagram use case.&lt;/p&gt;

&lt;p&gt;This is a technical deep-dive. We're going to cover what PaperBanana is, how the agentic loop works and how to actually use it, with code that runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Makes This Different From Previous Attempts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The graveyard of "natural language to chart" tools is substantial. Most of them fail in the same way: they generate a plausible first attempt and then have no mechanism for improving it. The gap between a plausible matplotlib output and a publication-ready figure is significant, typography, colour accessibility, journal-specific formatting requirements, legend placement, resolution and that gap requires iteration.&lt;/p&gt;

&lt;p&gt;PaperBanana's core insight is that figure generation is a multi-criteria quality problem and single-pass generation can't solve it reliably. The solution is an agentic critic-generator loop that iterates until quality thresholds are met. The Critic agent produces structured, actionable feedback. The Generator agent acts on that feedback. The loop continues until the output satisfies defined publication standards or hits a maximum iteration count.&lt;/p&gt;

&lt;p&gt;This sounds simple. It works remarkably well. And the architecture pattern generalises to any task where quality is multidimensional.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Agent Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Four agents. Specific responsibilities. Structured handoffs between them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INPUT
(natural language description + data)
         ↓
  ┌─────────────────┐
  │  PLANNER AGENT  │
  │                 │
  │ • Interprets    │
  │   request       │
  │ • Selects chart │
  │   type          │
  │ • Identifies    │
  │   data transforms│
  │ • Outputs spec  │
  └────────┬────────┘
           ↓
  ┌─────────────────┐
  │  CODE GENERATOR │◄──────────────────┐
  │  AGENT          │                   │
  │                 │                   │
  │ • Translates    │                   │
  │   spec to code  │                   │
  │ • matplotlib /  │                   │
  │   seaborn /     │                   │
  │   plotly        │                   │
  └────────┬────────┘                   │
           ↓                            │
  ┌─────────────────┐                   │
  │ EXECUTOR AGENT  │                   │
  │                 │                   │
  │ • Runs code in  │                   │
  │   sandbox       │                   │
  │ • Captures      │                   │
  │   output + errors│                  │
  └────────┬────────┘                   │
           ↓                            │
  ┌─────────────────┐    FAIL           │
  │  CRITIC AGENT   │───────────────────┘
  │                 │
  │ • Evaluates vs  │
  │   pub standards │
  │ • Structured    │
  │   feedback      │
  └────────┬────────┘
           │ PASS
           ↓
        OUTPUT
   (publication-ready figure)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Planner runs once. The Code Generator → Executor → Critic loop runs until quality threshold is reached. In practice this converges in three to five iterations for most figure types.&lt;/p&gt;

&lt;p&gt;The Critic agent is the piece that makes this work. It doesn't output "this needs improvement", it outputs structured feedback with severity ratings and specific suggested fixes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;json&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"quality_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.76&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"feedback"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HIGH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"formatting"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Figure width 195mm exceeds two-column maximum of 180mm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"suggested_fix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Set figsize=(7.09, height) — 7.09 inches = 180mm"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MEDIUM"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"accessibility"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Colour palette fails deuteranopia simulation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"suggested_fix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Replace current palette with Wong 2011: ['#000000', '#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"LOW"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"typography"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Axis label font weight lighter than journal standard"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"suggested_fix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Add fontweight='bold' to xlabel() and ylabel() calls"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This specificity is what makes the loop converge efficiently. The Code Generator doesn't need to guess what to fix, it receives exact, implementable instructions.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Installation and Setup&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bashgit clone https://github.com/google-research/paperbanana
&lt;span class="nb"&gt;cd &lt;/span&gt;paperbanana
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PaperBanana supports multiple LLM backends. For the Critic agent specifically, Claude Sonnet produces notably better structured feedback than the alternatives in our testing, the specificity and actionability&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;feedback&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt; &lt;span class="n"&gt;affects&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt; &lt;span class="n"&gt;fast&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="n"&gt;converges&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;paperbanana&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PaperBanana&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Config&lt;/span&gt;

&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# or "openai"
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dpi&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;style_preset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Options: "nature", "science", "ieee", 
&lt;/span&gt;    &lt;span class="c1"&gt;#          "neurips", "arxiv", "custom"
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PaperBanana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Your First Publication Figure&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The simplest use case. Describe what you want, provide your data, get a figure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonimport&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Experimental results data
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Method&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Baseline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Method A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Method B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
               &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Method C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ours&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;71.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;78.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;82.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;84.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;89.3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;F1_Score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;68.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;76.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;80.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;83.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;87.8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Inference_ms&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;12.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;45.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;38.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;61.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;29.7&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Accuracy_std&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;F1_std&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;1.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Create a grouped bar chart comparing five methods on 
Accuracy and F1 Score. Use a colourblind-accessible palette.
Include error bars from the std columns. Highlight the 
&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Ours&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; group with a distinct visual treatment.
Add a horizontal dashed line at 85 labeled 
&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;State-of-the-art threshold&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;. Nature journal formatting,
two-column width (180mm), 9pt Helvetica Neue.
Legend outside plot area, upper right.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./figures/method_comparison.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Iterations: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Quality score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output:
# Iterations: 4
# Quality score: 0.91
# Saved: ./figures/method_comparison.pdf
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four iterations. Quality score above threshold. Figure ready for submission.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Inspecting the Iteration Log&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The iteration inspector is one of PaperBanana's most useful features for understanding what the agent loop is doing and for debugging when it doesn't converge the way you expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;&lt;span class="c1"&gt;# Inspect what happened at each iteration
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iteration_log&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;─&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ITERATION &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; │ Quality: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;─&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;critic_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Critic feedback:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;critic_feedback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;icon&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🔴&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; \
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟡&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MEDIUM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🟢&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;icon&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ✓ Quality threshold met&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the method comparison figure above, the log looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;────────────────────────────────────────────────
ITERATION 1 │ Quality: 0.58
────────────────────────────────────────────────
Critic feedback:
  [accessibility] Default colour cycle fails 
     protanopia simulation — replace with 
     colourblind-safe palette
  [formatting] Figure 210mm wide, exceeds 
     two-column 180mm maximum
  [typography] Legend inside plot area, 
     overlapping bars at right edge
  [data] Error bars present but cap size 0 — 
     not visible in print
  [style] Minor: grid lines too prominent, 
     reduce alpha to 0.3

────────────────────────────────────────────────
ITERATION 2 │ Quality: 0.74
────────────────────────────────────────────────
Critic feedback:
  [formatting] 'Ours' group not visually 
     distinct — add hatching or edge highlight
  [typography] Axis labels 8pt, journal 
     minimum 9pt
  [data] State-of-the-art line label font 
     size inconsistent with axis labels

────────────────────────────────────────────────
ITERATION 3 │ Quality: 0.86
────────────────────────────────────────────────
Critic feedback:
  [formatting] Minor: x-axis label padding 
     slightly tight — increase labelpad to 8

────────────────────────────────────────────────
ITERATION 4 │ Quality: 0.91
────────────────────────────────────────────────
  ✓ Quality threshold met — output generated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the loop in practice. The first iteration catches the structural issues, wrong dimensions, inaccessible colours. The second iteration catches the medium-severity items. By iteration three the feedback is minor. Iteration four crosses the threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Extending the Critic for Journal-Specific Requirements&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The built-in Critic uses general publication standards. For specific journal requirements or custom style guides, extend it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;paperbanana&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CriticAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;paperbanana.evaluation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;EvaluationCriteria&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FeedbackItem&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NeurIPS2026Critic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CriticAgent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Critic extended with NeurIPS 2026 
    camera-ready requirements.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;EvaluationCriteria&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;max_width_mm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;177&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;min_font_size_pt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;required_font_family&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Times New Roman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;colour_accessibility&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_pdf_size_mb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;required_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PDF/A&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prohibited_elements&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rasterized_text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedded_fonts_missing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Run base evaluation
&lt;/span&gt;        &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Layer NeurIPS-specific checks
&lt;/span&gt;        &lt;span class="n"&gt;neurips_items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

        &lt;span class="c1"&gt;# Dimension check
&lt;/span&gt;        &lt;span class="n"&gt;width_mm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width_inches&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;25.4&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;width_mm&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_width_mm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;neurips_items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;FeedbackItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neurips_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Width &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;width_mm&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;mm exceeds &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NeurIPS max &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_width_mm&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;mm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;suggested_fix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set figsize width to &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_width_mm&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mf"&gt;25.4&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                              &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; inches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Font check  
&lt;/span&gt;        &lt;span class="n"&gt;detected_font&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;figure&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_primary_font&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;detected_font&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;required_font_family&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;neurips_items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;FeedbackItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neurips_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Font &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;detected_font&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; — NeurIPS requires &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;REQUIREMENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;required_font_family&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;suggested_fix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set plt.rcParams[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;font.family&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;] = &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                             &lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="s"&gt;Times New Roman&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; before plotting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge_feedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;neurips_items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use custom critic
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;critic_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;NeurIPS2026Critic&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;llm_backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;quality_threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.90&lt;/span&gt;  &lt;span class="c1"&gt;# Higher bar for camera-ready
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PaperBanana&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pattern, base agent with domain-specific extension via structured feedback items applies directly to other agentic use cases. Document review agents with organisation-specific criteria. Code review agents with team-specific standards. Data validation agents with domain-specific rules. The architecture is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Batch Generation for Multi-Figure Papers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Real papers have multiple figures and they need to be visually consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythonfigure_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fig1_training_curves&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Training and validation loss curves for three 
            model variants over 100 epochs. Log scale y-axis. 
            Use solid lines for training, dashed for validation.
            Mark the convergence epoch with a vertical line.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;training_df&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fig2_ablation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            Horizontal bar chart showing ablation study results.
            Highlight the full model row. Sort by performance 
            descending. Include percentage improvement labels 
            on each bar.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ablation_df&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fig3_qualitative&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            3x3 grid showing input/output pairs for qualitative 
            evaluation. Three rows: success cases, failure cases,
            edge cases. Add a thin red border on failure cases.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sample_images&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;figures&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;figure_set&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./paper_figures/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;consistency_check&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Verify visual consistency 
&lt;/span&gt;                              &lt;span class="c1"&gt;# across all figures
&lt;/span&gt;    &lt;span class="n"&gt;shared_style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;style_preset&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neurips&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;colour_palette&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wong2011&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_font_size&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;fig_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;converged&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fig_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;iterations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; iterations, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output:
# ✓ fig1_training_curves: 3 iterations, quality 0.88
# ✓ fig2_ablation: 4 iterations, quality 0.91  
# ✓ fig3_qualitative: 5 iterations, quality 0.87
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;consistency_check&lt;/code&gt;=True parameter runs a post-generation agent pass that verifies colour palette consistency, font size matching and style coherence across all figures. It's the detail that's tedious to manage manually and that PaperBanana handles automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where the Architecture Goes Beyond Research&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The critic-generator loop with structured feedback is the pattern. PaperBanana implements it for research figures. The same architecture handles any task where quality is multidimensional and single-pass generation can't reliably satisfy all dimensions simultaneously.&lt;br&gt;
Code review with team-specific standards. Document formatting with compliance requirements. Data pipeline validation against schema contracts. Report generation with brand guidelines. The structure is identical: define your quality criteria in the Critic, let the Generator iterate against them, exit when thresholds are met.&lt;/p&gt;

&lt;p&gt;Understanding how PaperBanana implements this at a concrete level is the most transferable thing in this article. The diagram generation is useful. The agentic pattern underneath it is what you want to carry into your next project.&lt;/p&gt;

&lt;p&gt;For the full deep-dive into the &lt;a href="https://dextralabs.com/blog/paperbanana-agentic-ai-framework/" rel="noopener noreferrer"&gt;PaperBanana agentic AI framework&lt;/a&gt;, covering the Planner's specification format, the Critic's evaluation rubrics and the prompt engineering that makes the loop converge reliably, the Dextra Labs writeup covers what a single Dev.to article can't fit.&lt;/p&gt;

&lt;p&gt;This is one example of agentic AI solving a specific, high-friction workflow in research. For production agentic systems at enterprise scale, custom agent architectures for document processing, data validation, complex multi-step automation across real enterprise workflows, &lt;strong&gt;&lt;a href="https://dextralabs.com/ai-agent-development-services/" rel="noopener noreferrer"&gt;Dextra Labs builds and deploys these systems&lt;/a&gt;&lt;/strong&gt; across industries. The patterns that make PaperBanana reliable in a research context are the same patterns that make enterprise agents reliable in production.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Claude Code vs Cursor vs GitHub Copilot: Honest Comparison After 30 Days</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Tue, 24 Mar 2026 05:14:18 +0000</pubDate>
      <link>https://dev.to/dextralabs/claude-code-vs-cursor-vs-github-copilot-honest-comparison-after-30-days-1030</link>
      <guid>https://dev.to/dextralabs/claude-code-vs-cursor-vs-github-copilot-honest-comparison-after-30-days-1030</guid>
      <description>&lt;p&gt;Claude Code vs Cursor vs GitHub Copilot: Honest Comparison After 30 Days&lt;br&gt;
I used each tool for real work, not demos. Here's what 30 days of daily use actually taught me.&lt;/p&gt;

&lt;p&gt;I want to be upfront about something before you read further.&lt;br&gt;
I'm not a tool reviewer. I'm a backend engineer who spent thirty days deliberately rotating between three AI coding assistants on production work, real features, real bugs, real legacy code because my team was about to make a purchasing decision and I didn't want to base it on YouTube demos and vendor comparison pages.&lt;/p&gt;

&lt;p&gt;What follows is a developer diary, not a benchmark. The numbers I'll share come from my actual work log: tasks completed, time estimates versus actuals, bugs that made it to review, and a honestly subjective but carefully considered rating of each tool's learning curve. Your mileage will vary based on your stack and workflow. But if you're a backend developer working primarily in Python and TypeScript on a mixed legacy and greenfield codebase, this is probably the most relevant thirty-day comparison you'll find.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Setup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;My stack&lt;/strong&gt;: Python FastAPI backend, TypeScript React frontend, PostgreSQL, some legacy Django services that predate my time at the company.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The task distribution&lt;/strong&gt;: I tried to give each tool roughly equivalent work across four categories, refactoring existing code, debugging production issues, building greenfield features, and navigating legacy code I'd never touched before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The rotation&lt;/strong&gt;: Week 1 and 2 on Claude Code, Week 3 on Cursor, Week 4 on GitHub Copilot. I finished Week 4 with a two-day side-by-side comparison on identical tasks to calibrate the subjective impressions from the diary entries.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Week 1–2: Claude Code&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First Impressions&lt;/strong&gt;&lt;br&gt;
Claude Code runs in the terminal, which immediately felt either liberating or alienating depending on your relationship with CLI tools. I'm comfortable there, so the initial setup friction was low. What struck me in the first hour was the conversational depth, you can describe what you're trying to accomplish at a high level and the tool asks clarifying questions before touching anything. That behaviour felt unusual coming from Copilot's autocomplete model, but I grew to appreciate it quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refactoring Task: Decomposing a 600-Line Service&lt;/strong&gt;&lt;br&gt;
The first real test was a service file that had grown to 600 lines over two years, mixing business logic, data access, and API formatting in ways that made every change feel dangerous. I described the problem to Claude Code, shared the file, and asked it to propose a decomposition before making any changes.&lt;/p&gt;

&lt;p&gt;What came back was a structured plan, three proposed modules, rationale for each boundary decision, a list of the shared state that would need to be handled explicitly during the split. I hadn't asked for a plan. It produced one anyway, and it was better than the rough sketch I'd been carrying in my head.&lt;/p&gt;

&lt;p&gt;The actual refactoring took about two hours of collaborative back-and-forth. Final result: four files instead of one, full test coverage on the extracted modules, zero regressions in the test suite. My estimate before starting had been a full day of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Time saved: ~4 hours. Bugs introduced: 0.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Debugging Task: The Async Mystery&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We had an intermittent 504 error in a background task processor that had been in the "investigate when we have time" category for six weeks. I described the symptoms, shared the relevant code sections, and asked Claude Code to help me think through the failure modes.&lt;/p&gt;

&lt;p&gt;The debugging session felt genuinely collaborative in a way that's hard to describe. It wasn't suggesting fixes, it was asking questions that forced me to articulate assumptions I'd been making implicitly. "What's the timeout configuration on the task queue client?" "Is the database connection pool shared between the web process and the worker?" Two questions in, I'd identified the root cause myself. Claude Code had functioned like a rubber duck that asks better questions than a rubber duck.&lt;/p&gt;

&lt;p&gt;Fix took 20 minutes. Six weeks of intermittent 504s gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved: Meaningfully. Bugs introduced: 0.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where It Got Frustrating&lt;/strong&gt;&lt;br&gt;
The terminal interface has a real cost for frontend work. When I needed to iterate on React component styling, the round-trip of describing visual changes in text and mentally mapping the response back to pixels was slower than just doing it myself. Claude Code is built for engineers who think in code and text. Visual iteration isn't its strength.&lt;/p&gt;

&lt;p&gt;The other friction point was context switching. Each session starts fresh by default, so if you're working across multiple files on a multi-day task, you're re-establishing context at the start of each session. This is manageable with good habits, I started keeping a brief context note I'd paste at session start, but it adds overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1–2 overall rating: 8.5/10 for backend work. 6/10 for frontend.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Week 3: Cursor&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First Impressions&lt;/strong&gt;&lt;br&gt;
Cursor is a VS Code fork with AI baked into the editor. If you're already living in VS Code, the transition is nearly frictionless, your extensions, your keybindings, your colour scheme, all carry over. The first time you hit Cmd+K on a selected block of code and describe what you want done to it, the experience feels genuinely magical in a way that terminal-based tools don't produce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Greenfield Feature: Building a New API Endpoint Set&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Week 3 happened to align with a sprint where I was building a new set of API endpoints for a reporting module, greenfield work with clear requirements, starting from scratch. This was Cursor's sweet spot.&lt;/p&gt;

&lt;p&gt;The inline generation is fast and context-aware in a way that changes the development rhythm. I'd write a function signature and a docstring describing intent, hit the shortcut, and get an implementation that was usually 80% correct and 100% stylistically consistent with the surrounding code. The iteration loop, generate, review, adjust, generate again became fluid enough that it stopped feeling like using a tool and started feeling like an accelerated version of typing.&lt;/p&gt;

&lt;p&gt;I shipped the full reporting endpoint set in one day. My sprint estimate had been three days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved: ~2 days. Bugs introduced: 2 (both caught in review — incorrect default parameter values).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legacy Code Task: The Django Archaeology Project&lt;/strong&gt;&lt;br&gt;
We have a Django service that processes financial transactions. It's five years old, written by people who've all left, and the documentation is optimistic at best. I needed to add a new transaction type and had no idea where to start.&lt;br&gt;
Cursor's codebase indexing made this significantly less painful than it would have been otherwise. I could ask questions about the codebase, "where is payment status updated?" "what calls this function?" and get accurate answers that saved the half-day of archaeological reading I'd normally do before touching anything.&lt;/p&gt;

&lt;p&gt;The actual implementation assistance was good but not perfect. Cursor occasionally suggested patterns that were internally consistent but inconsistent with conventions the existing codebase had established in modules it hadn't deeply indexed. The suggestions were plausible, just wrong for this specific context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved: ~3 hours on navigation. Bugs introduced: 1 (pattern mismatch caught in review).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where It Got Frustrating&lt;/strong&gt;&lt;br&gt;
Cursor's AI features require sending code to an external API, which created friction with our security team for the services with the most sensitive business logic. There's a privacy mode, but it disables some of the most useful features. For teams with strict data handling requirements, this is a real constraint that the demos don't surface.&lt;/p&gt;

&lt;p&gt;The other issue was suggestion quality on TypeScript generics and complex type manipulation. The suggestions were often syntactically correct but semantically wrong in ways that compiled but produced subtle type unsafety. I started being more cautious and reviewing TypeScript suggestions more carefully than Python ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3 overall rating: 9/10 for greenfield. 7/10 for legacy. Privacy constraints: significant for some teams.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Week 4: GitHub Copilot&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First Impressions&lt;/strong&gt;&lt;br&gt;
Coming back to Copilot after two weeks on Claude Code and one on Cursor felt like returning to something familiar, because it is. Copilot's autocomplete model is the baseline most of us have internalised. The ghost text appears, you Tab to accept or ignore, you move on. It's frictionless in a way the other tools aren't, because it doesn't ask anything of you.&lt;br&gt;
That frictionlessness is both its greatest strength and its fundamental limitation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging Task: Production Memory Leak&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Week 4 opened with a production incident, a memory leak in a data processing service that was causing OOM kills under sustained load. This was the kind of debugging task where I'd hoped Copilot's context awareness would shine.&lt;/p&gt;

&lt;p&gt;It helped, but less than the other tools had on equivalent tasks. Copilot's suggestions were reactive, it would suggest the next line of code I was writing well, but it couldn't engage with the debugging process at a higher level of abstraction. I'd write a hypothesis as a comment and it would suggest the code to test that hypothesis, which was useful. But the hypothesis generation was all me.&lt;/p&gt;

&lt;p&gt;The memory leak turned out to be a generator that was being accidentally materialised into a list in a hot path. Found it with traditional debugging augmented by Copilot's autocomplete helping me write the profiling code faster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved: ~30 minutes on instrumentation code. Bugs introduced: 0.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refactoring Task: TypeScript Interface Consolidation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was Copilot's best week. We had a TypeScript frontend with interface definitions scattered across twelve files, many overlapping, some contradictory. The task was to consolidate them into a coherent type system.&lt;/p&gt;

&lt;p&gt;Copilot's pattern completion on TypeScript interfaces is excellent. As I worked through the consolidation manually, it was consistently predicting the correct interface extensions, the right generic constraints, and the appropriate utility types. The work was still primarily mine, but the acceleration on the mechanical parts was real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time saved: ~2 hours. Bugs introduced: 0.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where It Got Frustrating&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Copilot's context window is narrow compared to the other tools. It knows the current file and some of the surrounding files, but it doesn't have the project-level awareness that Cursor's indexing or Claude Code's conversational context provides. For any task that requires understanding how pieces fit together across the codebase, you're on your own.&lt;/p&gt;

&lt;p&gt;The other limitation is the ceiling. Copilot makes you faster at what you already know how to do. It doesn't help you figure out what to do when you're genuinely uncertain. For junior developers or engineers working outside their comfort zone, the gap between Copilot and the reasoning-first tools is significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 4 overall rating: 8/10 for mechanical tasks. 6/10 for complex reasoning&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Head-to-Head: Two Days, Identical Tasks&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At the end of Week 4 I spent two days running the same four tasks on all three tools to calibrate the diary impressions with direct comparison. The tasks: write a new database migration with rollback logic, debug a failing test with a non-obvious root cause, refactor a function with too many responsibilities, and explain an unfamiliar section of the codebase.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgpho6utcslk1wcpdfzb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgpho6utcslk1wcpdfzb.png" alt=" " width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Honest Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Claude Code&lt;/strong&gt; if you're doing complex backend work, debugging thorny issues, or working on tasks where understanding the problem deeply matters more than generating code quickly. The reasoning quality is the best of the three. The terminal interface is a real cost for frontend work and visual iteration. For the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/claude-code-alternatives-for-developers/" rel="noopener noreferrer"&gt;Claude Code alternatives for developers&lt;/a&gt;&lt;/strong&gt; who find the CLI workflow uncomfortable, Cursor is the closest alternative that preserves most of the reasoning depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cursor if&lt;/strong&gt; you want the best balance of reasoning quality and IDE integration. The greenfield development experience is excellent and the codebase navigation is genuinely useful for large or unfamiliar codebases. Check your data handling requirements before deploying it on sensitive services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Copilot if&lt;/strong&gt; your team is already paying for it (many are through enterprise GitHub), you're doing primarily TypeScript or well-typed Python work, and your use cases are more "go faster at things I know how to do" than "help me figure out things I don't know how to do." It's the lowest friction option and that has real value at the margin.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers across 30 days&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eizxkx4h48xwj0yyden.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eizxkx4h48xwj0yyden.png" alt=" " width="800" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;None of these tools is the right answer for every team or every task. The right choice depends on your stack, your team's comfort with different interfaces, your data handling requirements, and whether your primary bottleneck is reasoning quality or mechanical speed.&lt;/p&gt;

&lt;p&gt;Choosing the right AI coding assistant depends on your stack, team size, and use case. For enterprise teams navigating this decision across multiple developers and compliance requirements, consulting with specialists like &lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt;&lt;/strong&gt; can save months of trial and error, they've run these evaluations across enough enterprise stacks to have opinions worth hearing.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What's your experience been? I'm curious whether the Claude Code vs Cursor reasoning quality gap holds for other stacks or whether it's specific to the Python/TypeScript combination I was working in. Drop it in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>10 AI Code Review Tools That Actually Caught Bugs My Team Missed</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Mon, 23 Mar 2026 05:04:48 +0000</pubDate>
      <link>https://dev.to/dextralabs/10-ai-code-review-tools-that-actually-caught-bugs-my-team-missed-n8g</link>
      <guid>https://dev.to/dextralabs/10-ai-code-review-tools-that-actually-caught-bugs-my-team-missed-n8g</guid>
      <description>&lt;p&gt;&lt;em&gt;I planted 23 bugs across a real codebase. Here's what each tool found and what slipped through.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let me tell you how this started.&lt;br&gt;
Three months ago, a bug made it to production that had survived four human code reviews, a CI pipeline and two rounds of QA. It wasn't subtle, it was a classic off-by-one error in a pagination function that only surfaced under a specific combination of filter conditions. One of those bugs that's embarrassingly obvious in retrospect and genuinely invisible in a forward pass through a pull request.&lt;/p&gt;

&lt;p&gt;After the incident retrospective, someone on the team asked the question we'd been avoiding: should we be using AI code review tools? We'd all seen the demos. We'd all nodded along to the conference talks. None of us had actually run a systematic evaluation.&lt;/p&gt;

&lt;p&gt;So I ran one.&lt;/p&gt;

&lt;p&gt;I took a real service from our codebase, a Python FastAPI backend with about 4,000 lines of active code and planted 23 bugs across it. Some obvious, some subtle, some genuinely nasty. Then I ran ten different AI code review tools against it and tracked exactly what each one caught, what it missed and how many false positives it generated along the way.&lt;br&gt;
Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Bug Set&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before the results, it helps to know what I was testing against. The 23 planted bugs fell into five categories:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logic errors (6)&lt;/strong&gt; — off-by-one conditions, incorrect boolean operators, wrong comparison operators in conditional branches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security vulnerabilities (5)&lt;/strong&gt; — SQL injection via string formatting, missing authentication checks on endpoints, exposed sensitive data in logs, insecure random number generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Race conditions (4)&lt;/strong&gt; — shared state mutations without locks, async functions with incorrect await patterns, database transactions without proper isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Type errors (4)&lt;/strong&gt; — incorrect type assumptions on function inputs, missing None checks, wrong type coercions in data transformation functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance issues (4)&lt;/strong&gt; — N+1 query patterns, missing database indexes referenced in query plans, inefficient list operations in hot paths.&lt;br&gt;
I ran each tool with default configuration first, then with team-specific configuration where the tool supported it. Detection rates below are from the default configuration run unless noted.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. GitHub Copilot Code Review&lt;br&gt;
Bugs caught: 16/23 (70%) | False positives: 4 | Speed: Fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Copilot's inline review is the one most teams are already closest to and it earned its place at the top of this list. It caught all five security vulnerabilities, the SQL injection, the auth bypass, the log exposure without any configuration. The logic errors were hit or miss: it caught four of six, missing both cases where the error was in a complex nested conditional.&lt;/p&gt;

&lt;p&gt;Where it genuinely surprised me was on the N+1 query patterns. It flagged two of the four performance issues and gave actionable query restructuring suggestions, not just a flag. The suggestions weren't always idiomatic for our specific ORM, but they pointed in the right direction.&lt;/p&gt;

&lt;p&gt;The four false positives were all style-related, it flagged variable naming conventions that matched our internal style guide but differed from PEP 8 defaults. Configurable, but annoying out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. CodeRabbit&lt;br&gt;
Bugs caught: 15/23 (65%) | False positives: 6 | Speed: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodeRabbit's PR-level summary is genuinely useful, it gives you a plain-English description of what changed and why it matters before diving into line-level comments. For teams where reviewers aren't always familiar with the full context of a change, this framing helps.&lt;/p&gt;

&lt;p&gt;Detection-wise, it was strong on security (caught 4/5) and logic errors (5/6) but weak on race conditions, it caught one of four and the three it missed were the genuinely subtle ones involving async patterns. The six false positives were more annoying than Copilot's, including two suggestions to add docstrings to private helper functions that are explicitly excluded from our documentation standards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cursor with Claude Backend&lt;br&gt;
Bugs caught: 15/23 (65%) | False positives: 3 | Speed: Fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cursor uses Claude under the hood for its review capabilities and the difference in reasoning quality shows on the complex bugs. It caught both of the nested conditional logic errors that Copilot missed. The explanation it provided for the race condition it identified was the most accurate of any tool, it correctly described the exact timing window that would cause the issue rather than giving a generic "potential race condition" warning.&lt;/p&gt;

&lt;p&gt;The three false positives were all genuinely borderline, two cases where there was a reasonable argument for the suggestion and one where it flagged an intentional pattern as a potential issue. Lowest false positive rate of the ten tools tested.&lt;/p&gt;

&lt;p&gt;For teams already in the Cursor workflow, the review capability is a meaningful addition without requiring a separate tool evaluation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Sourcegraph Cody&lt;br&gt;
Bugs caught: 14/23 (61%) | False positives: 5 | Speed: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cody's strength is codebase context. Because it indexes your entire repository, its suggestions account for patterns elsewhere in the codebase in a way that prompt-based tools can't. It caught a bug that no other tool identified, a type error that only manifested because of a pattern established in a utility function in a completely separate module. The cross-file reasoning was genuinely impressive.&lt;/p&gt;

&lt;p&gt;Where it fell short was on security vulnerabilities, it caught 3/5, missing the insecure random number generation and one of the authentication issues. The false positives skewed toward over-eager suggestions to refactor code that was functioning correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. DeepCode (Snyk Code)&lt;br&gt;
Bugs caught: 14/23 (61%) | False positives: 8 | Speed: Slow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;DeepCode is the security specialist of the group and it shows. It caught all five security vulnerabilities, matching Copilot and provided the most detailed remediation guidance of any tool. The SQL injection finding came with a code example showing the parameterised query pattern, the affected line and a link to the relevant CWE entry. For a security-focused review, this depth is valuable.&lt;/p&gt;

&lt;p&gt;The eight false positives were the highest of the group and several were security warnings on code patterns that were safe in context but matched patterns that are sometimes unsafe. This is the fundamental tension in security static analysis, specificity vs. sensitivity and DeepCode errs toward sensitivity. For a security audit that's the right call. For a daily development workflow it generates review fatigue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Amazon CodeGuru&lt;br&gt;
Bugs caught: 13/23 (57%) | False positives: 5 | Speed: Slow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodeGuru's strength is performance analysis and it earned that reputation in this test. It caught all four performance issues including one N+1 pattern that involved an ORM relationship that wasn't immediately obvious  and its performance recommendations were the most actionable of any tool. The estimated latency impact it provides alongside performance suggestions is a feature I haven't seen elsewhere.&lt;/p&gt;

&lt;p&gt;The trade-off is coverage breadth. It missed three of the security vulnerabilities and two of the race conditions. For teams where performance is the primary concern, it's excellent. As a general-purpose review tool, the gaps are significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Tabnine Enterprise&lt;br&gt;
Bugs caught: 12/23 (52%) | False positives: 4 | Speed: Fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tabnine's review capability has improved significantly in the enterprise version, but it still feels more like an enhanced linter than a reasoning engine. It caught logic errors and obvious security issues reliably, but the subtle bugs the race conditions, the cross-module type error, went undetected. The false positive rate was reasonable and the suggestions were concise. For teams that want a lightweight tool that won't generate review noise, it's a reasonable choice at its price point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. SonarQube with AI Extensions&lt;br&gt;
Bugs caught: 12/23 (52%) | False positives: 11 | Speed: Slow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SonarQube is the incumbent in this space and the AI extensions add genuine capability over the rule-based baseline. But the false positive rate, eleven in this test, reflects an architecture that was built around rule matching and retrofitted with AI analysis rather than built AI-first. The combination produces both sets of false positives. For teams already invested in the SonarQube ecosystem, the AI extensions are worth enabling. For teams evaluating from scratch, the newer tools are cleaner.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;9. Qodo (formerly CodiumAI)&lt;br&gt;
Bugs caught: 11/23 (48%) | False positives: 4 | Speed: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Qodo's differentiation is test generation, it's primarily a tool for suggesting and generating tests, with code review as a secondary capability. Evaluated purely on bug detection it lands at 48%, but that undersells what it actually does well. The tests it suggested for the functions containing bugs would have caught six of the bugs I planted, in a sense, its indirect bug detection via test generation is more valuable than its direct review flagging. A different way of thinking about the same problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10. CodeClimate with AI&lt;br&gt;
Bugs caught: 9/23 (39%) | False positives: 6 | Speed: Medium&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CodeClimate's AI integration is the thinnest of the group. The core product is a maintainability and test coverage tool and the AI layer adds pattern-based review that doesn't match the reasoning quality of the AI-first tools. It caught the obvious logic errors and one security issue but missed everything in the subtle categories. Useful for maintainability metrics, not the right tool if bug detection is the primary goal.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Summary Table&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9cgv1ah0eso9expcu2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9cgv1ah0eso9expcu2t.png" alt=" " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What the Data Actually Tells You&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Three patterns worth pulling out of this before you make a decision.&lt;br&gt;
No single tool caught everything. The union of bugs caught across all tools was 21 of 23, the two remaining bugs (both complex race conditions) weren't caught by any tool in default configuration. Human review is still part of the stack. These tools raise the floor, they don't replace the ceiling.&lt;/p&gt;

&lt;p&gt;False positive rate matters as much as detection rate. A tool that catches 70% of bugs but generates 20 false positives per PR will get disabled by your team within a month. Review fatigue is real. The tools that have invested in reducing false positives, Cursor, Copilot, Tabnine, show it in adoption numbers for a reason.&lt;/p&gt;

&lt;p&gt;Security and performance specialists are genuinely worth it for those domains. If your threat model makes security review critical, running DeepCode alongside a general tool is worth the overlap. If performance regressions are your primary concern, CodeGuru's analysis depth justifies it. The specialists outperform the generalists in their specific domain.&lt;/p&gt;

&lt;p&gt;For the full breakdown including pricing, team size recommendations and integration complexity for each tool, the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/top-ai-code-review-tools/" rel="noopener noreferrer"&gt;top AI code review tools&lt;/a&gt;&lt;/strong&gt; comparison from Dextra Labs covers what a single article can't. If you're specifically evaluating AI-native editors rather than standalone review tools, the &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/claude-code-alternatives-for-developers/" rel="noopener noreferrer"&gt;Claude Code alternatives for developers&lt;/a&gt;&lt;/strong&gt; guide covers that adjacent decision in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;My Current Setup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After running this evaluation, our team settled on Copilot for inline review, it's already in the IDE and the detection rate justifies the subscription, with DeepCode running on the CI pipeline specifically for security-focused PRs touching authentication, data handling, or external API integration. The combination covers the security specialist gap that Copilot has without adding review noise to every PR.&lt;/p&gt;

&lt;p&gt;Cursor is on trial for two engineers who do the most complex backend work. The reasoning quality on subtle bugs is noticeably better and the false positive rate is the best of anything I tested. Broader rollout decision pending.&lt;/p&gt;

&lt;p&gt;The bug that made it to production three months ago? I planted its pattern in the test set. Copilot caught it. DeepCode caught it. Cursor caught it with the most accurate explanation of why it was dangerous.&lt;/p&gt;

&lt;p&gt;We would have saved an incident retrospective and a very uncomfortable all-hands if we'd had any of these running at the time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you're evaluating which tools fit your stack, Dextra Labs compiled a detailed comparison with pricing, integration complexity and feature breakdowns for each tool in this list, including team size recommendations and procurement guidance.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Build an AI Agent from Scratch Using Claude API (With Full Code)</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Sun, 22 Mar 2026 07:28:12 +0000</pubDate>
      <link>https://dev.to/dextralabs/how-to-build-an-ai-agent-from-scratch-using-claude-api-with-full-code-4b40</link>
      <guid>https://dev.to/dextralabs/how-to-build-an-ai-agent-from-scratch-using-claude-api-with-full-code-4b40</guid>
      <description>&lt;p&gt;I've built a lot of AI demos that looked impressive in a notebook and fell apart in production. The usual culprit? Treating an LLM like a search engine, one prompt in, one answer out, instead of what it actually is: a reasoning engine you can wire into real workflows.&lt;/p&gt;

&lt;p&gt;This tutorial is about doing it properly. We're going to build a functional AI agent using Anthropic's Claude API from the ground up, not a wrapper around a framework, but the actual mechanics: a ReAct loop, custom tool use, and a structure you can actually deploy. By the end you'll have running code and a mental model that makes every agent tutorial after this one make sense.&lt;/p&gt;

&lt;p&gt;Let's get into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What We're Actually Building&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The agent we're building will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept a user query&lt;/li&gt;
&lt;li&gt;Decide which tools it needs to answer&lt;/li&gt;
&lt;li&gt;Call those tools, observe the results&lt;/li&gt;
&lt;li&gt;Reason over the results and either call more tools or return a final answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern is called ReAct (Reasoning + Acting). It's the backbone of most production agents and it maps cleanly onto how Claude's tool use API works.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Prerequisites&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;bash&lt;br&gt;
pip install anthropic python-dotenv&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You'll need a Claude API key from console.anthropic.com. Store it safely:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;bash&lt;br&gt;
.env&lt;br&gt;
ANTHROPIC_API_KEY=your_key_here&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Basic Claude API Setup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before building the agent, let's confirm you can talk to Claude.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;ask_claude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

&lt;span class="c1"&gt;# Quick test
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;ask_claude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is 2 + 2? Answer in one word.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the foundation. If this runs cleanly, you're ready to build on it. For a deeper breakdown of model selection and API parameters, &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/claude-opus-vs-sonnet-vs-haiku/" rel="noopener noreferrer"&gt;the how to use Claude API tutorial&lt;/a&gt;&lt;/strong&gt; from Dextra Labs is worth reading before you go further.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Define Your Tools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Tools are the agent's hands. Without them, Claude can only reason, it can't act. We'll define three tools that our agent can use: a calculator, a web search simulator, and a file writer.&lt;br&gt;
In Claude's API, tools are defined as JSON schemas. Claude reads these schemas and decides when and how to call them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Performs basic arithmetic. Use this for any math operations.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Math expression to evaluate, e.g. &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;15 * 24 + 100&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Searches the web for current information on a topic.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The search query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;save_to_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saves text content to a local file.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input_schema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's write the actual Python functions that execute when Claude calls these tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Safe eval for math expressions
&lt;/span&gt;        &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__dict__&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
                   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__builtins__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}},&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# In production, wire this to SerpAPI, Tavily, or Brave Search
&lt;/span&gt;    &lt;span class="c1"&gt;# Simulated response for tutorial purposes
&lt;/span&gt;    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search results for &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Simulated] Top result: Relevant information about &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;from authoritative sources. Published 2025.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;save_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Successfully saved to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error saving file: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;Tool&lt;/span&gt; &lt;span class="n"&gt;dispatcher&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;calculator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expression&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;web_search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;web_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tool_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;save_to_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;save_to_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tool_input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tool_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The dispatcher is intentionally simple here. In production you'd want a registry pattern, but for learning, explicit is better than clever.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Build the ReAct Agent Loop&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the core of the tutorial. The ReAct loop works like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send the user query + available tools to Claude&lt;/li&gt;
&lt;li&gt;Claude either returns a final answer OR a tool call request&lt;/li&gt;
&lt;li&gt;If tool call → execute it, send result back to Claude&lt;/li&gt;
&lt;li&gt;Repeat until Claude returns a final answer
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;system_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are a helpful AI agent with access to tools.
    Think step by step. Use tools when you need real data or calculations.
    When you have enough information, provide a clear final answer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;[Iteration &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;iteration&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Call Claude with tools
&lt;/span&gt;        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stop reason: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# If Claude is done reasoning, return the final answer
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_turn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;final_answer&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Final Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;final_answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_answer&lt;/span&gt;

        &lt;span class="c1"&gt;# If Claude wants to use tools
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Add Claude's response to message history
&lt;/span&gt;            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="c1"&gt;# Process each tool call
&lt;/span&gt;            &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Tool: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Input: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                    &lt;span class="c1"&gt;# Execute the tool
&lt;/span&gt;                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                    &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="c1"&gt;# Send tool results back to Claude
&lt;/span&gt;            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Max iterations reached without a final answer.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight here is the message history. Every tool call and result gets appended to &lt;code&gt;messages&lt;/code&gt;, so Claude always has full context of what it's already tried. This is what separates a stateful agent from a stateless chatbot.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Run It&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Test 1: Math + file output
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Calculate compound interest on $10,000 at 7% for 10 years, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;then save the result to &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;investment.txt&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Test 2: Research + synthesis
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search for information about RAG architecture &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and summarize the key components.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;Test&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Multi&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="n"&gt;reasoning&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the square root of 144 multiplied by the number of days in a leap year?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this and watch the agent reason through each step in your terminal. The iteration logs show you exactly how Claude decides which tool to call and when to stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Adding Memory (The Production Upgrade)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The agent above is stateless, each &lt;code&gt;run_agent&lt;/code&gt; call starts fresh. For real applications you need conversation memory. Here's a minimal implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentWithMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Anthropic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Add user message to history
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful assistant with memory of our conversation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Handle tool use within persistent history
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stop_reason&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;tool_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;tool_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_use_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
                    &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tool_results&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="c1"&gt;# Recursive call to get final answer
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;assistant_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;assistant_message&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;assistant_message&lt;/span&gt;

&lt;span class="c1"&gt;## **Usage**
&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentWithMemory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My budget is $50,000. Calculate 7% annual return over 5 years.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Now do the same calculation but for 10 years.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Claude&lt;/span&gt; &lt;span class="n"&gt;remembers&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;000&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;first&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;conversation_history&lt;/code&gt; list is doing all the heavy lifting here. In production you'd persist this to Redis or a database between sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What to Build Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once this is running, the natural next steps are:&lt;br&gt;
&lt;strong&gt;Streaming responses&lt;/strong&gt; — use &lt;code&gt;client.messages.stream()&lt;/code&gt; for real-time output in web apps. &lt;br&gt;
&lt;strong&gt;Error handling and retries&lt;/strong&gt; — wrap tool calls in try/except with exponential backoff. &lt;br&gt;
&lt;strong&gt;Async execution&lt;/strong&gt; — parallel tool calls with &lt;code&gt;asyncio&lt;/code&gt; cut latency significantly on multi-tool queries. &lt;br&gt;
&lt;strong&gt;Structured outputs&lt;/strong&gt; — use Pydantic models to enforce tool input/output schemas.&lt;/p&gt;

&lt;p&gt;For the full architecture patterns and production deployment strategies, &lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs published an in-depth guide on Claude AI agents architecture and deployment&lt;/a&gt;&lt;/strong&gt; covering containerization, monitoring, and scaling patterns beyond what fits in a single tutorial.&lt;br&gt;
The full repo for this tutorial is available at: &lt;code&gt;github.com/dextralabs/claude-agent-tutorial&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Quick Recap&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;What you just built is a genuine ReAct agent, not a chatbot with a system prompt, but a reasoning loop that can call real functions, observe results, and chain multiple steps together. The same pattern powers production agents handling customer support, code review, document analysis, and research workflows at scale.&lt;br&gt;
The code here is intentionally minimal. Strip away the frameworks and this is what's underneath all of them.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>ai</category>
      <category>duedilligence</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Continuous Refactoring with LLMs: Patterns That Work in Production</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Sat, 28 Feb 2026 17:42:54 +0000</pubDate>
      <link>https://dev.to/dextralabs/continuous-refactoring-with-llms-patterns-that-work-in-production-136e</link>
      <guid>https://dev.to/dextralabs/continuous-refactoring-with-llms-patterns-that-work-in-production-136e</guid>
      <description>&lt;p&gt;Large Language Models are no longer prototypes running in notebooks.&lt;br&gt;
They’re running in production systems that serve thousands (sometimes millions) of users.&lt;/p&gt;

&lt;p&gt;And that changes everything.&lt;/p&gt;

&lt;p&gt;If you’re working on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM engineering&lt;/li&gt;
&lt;li&gt;RAG pipeline optimization&lt;/li&gt;
&lt;li&gt;AI agents orchestration&lt;/li&gt;
&lt;li&gt;Enterprise AI architecture&lt;/li&gt;
&lt;li&gt;AI code review automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then one truth becomes painfully clear:&lt;/p&gt;

&lt;p&gt;Shipping once is easy. Maintaining and refactoring continuously is hard.&lt;/p&gt;

&lt;p&gt;This blog breaks down &lt;strong&gt;battle-tested patterns&lt;/strong&gt; for continuous refactoring with LLM systems, patterns that actually work in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Continuous Refactoring is Mandatory in LLM Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Traditional software:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logic is deterministic&lt;/li&gt;
&lt;li&gt;Behavior is testable&lt;/li&gt;
&lt;li&gt;Refactors are structural&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LLM systems:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Behavior is probabilistic&lt;/li&gt;
&lt;li&gt;Prompts change output drastically&lt;/li&gt;
&lt;li&gt;Data drift changes performance&lt;/li&gt;
&lt;li&gt;Model updates break assumptions&lt;/li&gt;
&lt;li&gt;Latency and cost fluctuate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM systems behave more like &lt;strong&gt;living organisms&lt;/strong&gt; than static software.&lt;/p&gt;

&lt;p&gt;So your architecture must evolve continuously.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 1: Treat Prompts as First-Class Code&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One of the biggest anti-patterns in LLM engineering:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;prompt = "Answer the question politely."&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That’s not engineering. That’s chaos.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version prompts in Git&lt;/li&gt;
&lt;li&gt;Add prompt tests&lt;/li&gt;
&lt;li&gt;Use prompt linting&lt;/li&gt;
&lt;li&gt;Maintain changelog&lt;/li&gt;
&lt;li&gt;Measure output drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Prompt Refactoring Framework:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Refactor Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;System Prompt&lt;/td&gt;
&lt;td&gt;Stability + constraints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context Injection&lt;/td&gt;
&lt;td&gt;Reduce noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Few-shot Examples&lt;/td&gt;
&lt;td&gt;Optimize token efficiency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Formatting&lt;/td&gt;
&lt;td&gt;Enforce structured JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tip: Treat prompt updates like schema migrations, never casual edits.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 2: RAG Pipeline Refactoring Through Observability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Your RAG pipeline is not “set and forget.”&lt;/p&gt;

&lt;p&gt;It degrades.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Common Production Issues&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval irrelevance&lt;/li&gt;
&lt;li&gt;Embedding drift&lt;/li&gt;
&lt;li&gt;Chunking inefficiency&lt;/li&gt;
&lt;li&gt;Over-tokenization&lt;/li&gt;
&lt;li&gt;Context dilution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production Refactor Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Add Retrieval Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Top-K relevance score&lt;/li&gt;
&lt;li&gt;MRR (Mean Reciprocal Rank)&lt;/li&gt;
&lt;li&gt;Query → chunk match rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Continuous Chunk Optimization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic chunk size testing&lt;/li&gt;
&lt;li&gt;Metadata enrichment refactors&lt;/li&gt;
&lt;li&gt;Query intent classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Retrieval A/B Testing&lt;/strong&gt;&lt;br&gt;
Split traffic between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dense-only&lt;/li&gt;
&lt;li&gt;Hybrid search&lt;/li&gt;
&lt;li&gt;Re-ranking model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pro Tip: A RAG pipeline is a product, not an integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 3: Refactoring AI Agents (Without Breaking Them)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI agents are seductive.&lt;br&gt;
But production agents are fragile.&lt;/p&gt;

&lt;p&gt;When scaling AI agents, refactoring means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reducing hallucinated tool calls&lt;/li&gt;
&lt;li&gt;Improving tool selection accuracy&lt;/li&gt;
&lt;li&gt;Lowering execution loops&lt;/li&gt;
&lt;li&gt;Preventing infinite recursion&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Production-Grade Agent Refactor Checklist&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tool call validation layer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Execution timeout guard&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Retry with structured fallback&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deterministic planning phase&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Logging full thought chains (internally only)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In enterprise AI architecture, agents should:&lt;/p&gt;

&lt;p&gt;Plan deterministically.&lt;br&gt;
Execute probabilistically.&lt;br&gt;
Validate strictly.&lt;/p&gt;

&lt;p&gt;That separation alone reduces failure rates dramatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 4: Enterprise AI Architecture Requires Modular LLM Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In early-stage systems, everything talks to the LLM directly.&lt;/p&gt;

&lt;p&gt;In production? That becomes a nightmare.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Refactor: Layered AI Architecture&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;Client Layer&lt;br&gt;
    ↓&lt;br&gt;
Orchestration Layer&lt;br&gt;
    ↓&lt;br&gt;
LLM Abstraction Layer&lt;br&gt;
    ↓&lt;br&gt;
Retrieval Layer&lt;br&gt;
    ↓&lt;br&gt;
Observability &amp;amp; Evaluation Layer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because this enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model switching&lt;/li&gt;
&lt;li&gt;Provider abstraction&lt;/li&gt;
&lt;li&gt;Cost optimization&lt;/li&gt;
&lt;li&gt;Prompt version control&lt;/li&gt;
&lt;li&gt;Centralized monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where &lt;strong&gt;LLM engineering becomes real software engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 5: AI Code Review with LLMs (That Developers Trust)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AI code review tools are everywhere.&lt;/p&gt;

&lt;p&gt;Most fail because they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Over-comment&lt;/li&gt;
&lt;li&gt;Suggest trivial refactors&lt;/li&gt;
&lt;li&gt;Ignore project conventions&lt;/li&gt;
&lt;li&gt;Lack context awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production Refactor Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provide repository-wide context&lt;/li&gt;
&lt;li&gt;Inject style guide automatically&lt;/li&gt;
&lt;li&gt;Limit comments to risk-based review&lt;/li&gt;
&lt;li&gt;Add confidence scoring&lt;/li&gt;
&lt;li&gt;Allow dev override learning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The secret?&lt;/p&gt;

&lt;p&gt;AI code review must behave like a senior engineer, not a linter.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 6: Continuous Evaluation Pipelines&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you're not measuring, you're guessing.&lt;/p&gt;

&lt;p&gt;Modern LLM systems need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic evaluation datasets&lt;/li&gt;
&lt;li&gt;Golden response tracking&lt;/li&gt;
&lt;li&gt;Drift detection&lt;/li&gt;
&lt;li&gt;Latency benchmarking&lt;/li&gt;
&lt;li&gt;Cost regression alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build an LLM CI/CD Loop&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Prompt Change →&lt;br&gt;
Offline Evaluation →&lt;br&gt;
Shadow Deployment →&lt;br&gt;
Live Monitoring →&lt;br&gt;
Auto Rollback if Degraded&lt;/p&gt;

&lt;p&gt;This is DevOps for AI systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pattern 7: Cost-Aware Refactoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LLM systems are expensive if left unoptimized.&lt;/p&gt;

&lt;p&gt;Refactor targets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Over-context injection&lt;/li&gt;
&lt;li&gt;Redundant summarization steps&lt;/li&gt;
&lt;li&gt;Multi-model routing inefficiencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Introduce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smart model routing (small model → large model fallback)&lt;/li&gt;
&lt;li&gt;Response caching&lt;/li&gt;
&lt;li&gt;Embedding reuse&lt;/li&gt;
&lt;li&gt;Adaptive context window trimming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost optimization is architecture, not finance.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Common Refactoring Anti-Patterns&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Blind model upgrades&lt;/li&gt;
&lt;li&gt;Increasing context instead of fixing retrieval&lt;/li&gt;
&lt;li&gt;Ignoring evaluation data&lt;/li&gt;
&lt;li&gt;Treating hallucination as unavoidable&lt;/li&gt;
&lt;li&gt;Shipping without observability&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Real-World Enterprise Perspective&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In enterprise environments, continuous refactoring becomes even more critical because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compliance constraints evolve&lt;/li&gt;
&lt;li&gt;Data sources change&lt;/li&gt;
&lt;li&gt;Governance policies tighten&lt;/li&gt;
&lt;li&gt;Security reviews require traceability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where companies often bring in specialists.&lt;/p&gt;

&lt;p&gt;For example, firms like [&lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt; – AI Consulting &amp;amp; LLM Engineering Experts]&lt;/strong&gt; help enterprises design scalable &lt;strong&gt;enterprise AI architecture&lt;/strong&gt;, production-grade &lt;strong&gt;RAG pipelines&lt;/strong&gt;, and robust &lt;strong&gt;AI agents&lt;/strong&gt; with continuous evaluation baked in from day one.&lt;/p&gt;

&lt;p&gt;Rather than just building demos, they focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long-term &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/llm-evaluation/" rel="noopener noreferrer"&gt;LLM&lt;/a&gt;&lt;/strong&gt; system stability&lt;/li&gt;
&lt;li&gt;Refactor-friendly architectures&lt;/li&gt;
&lt;li&gt;AI governance alignment&lt;/li&gt;
&lt;li&gt;Measurable &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/corporate-real-estate-ai-pilots-are-exploding-so-why-is-roi-still-missing/" rel="noopener noreferrer"&gt;https://dextralabs.com/blog/corporate-real-estate-ai-pilots-are-exploding-so-why-is-roi-still-missing/&lt;/a&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because production AI is not a hackathon project.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Future: Self-Refactoring LLM Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We’re already seeing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI that rewrites its own prompts&lt;/li&gt;
&lt;li&gt;Agents that optimize retrieval&lt;/li&gt;
&lt;li&gt;LLM-based AI code review systems refactoring pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But until that becomes reliable, humans must design:&lt;/p&gt;

&lt;p&gt;Refactorable-by-default LLM systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Production Checklist&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before you scale your LLM system, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is prompt versioning implemented?&lt;/li&gt;
&lt;li&gt;Do we measure retrieval performance?&lt;/li&gt;
&lt;li&gt;Can we switch models safely?&lt;/li&gt;
&lt;li&gt;Are agents bounded and validated?&lt;/li&gt;
&lt;li&gt;Do we run continuous evaluation?&lt;/li&gt;
&lt;li&gt;Is cost observable in real time?&lt;/li&gt;
&lt;li&gt;Is architecture modular?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If not, refactor before you scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Closing Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Continuous refactoring with LLMs isn’t optional.&lt;/p&gt;

&lt;p&gt;It’s the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A flashy demo&lt;/li&gt;
&lt;li&gt;And a sustainable AI product&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As LLM engineering matures, the teams that win won’t be the ones who ship first.&lt;/p&gt;

&lt;p&gt;They’ll be the ones who refactor continuously.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>RAG Projects That Teach You Real Retrieval Engineering (Not Just Prompt Hacking)</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Sat, 28 Feb 2026 06:55:01 +0000</pubDate>
      <link>https://dev.to/dextralabs/rag-projects-that-teach-you-real-retrieval-engineering-not-just-prompt-hacking-2007</link>
      <guid>https://dev.to/dextralabs/rag-projects-that-teach-you-real-retrieval-engineering-not-just-prompt-hacking-2007</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Because building LLM apps isn’t about clever prompts anymore, it’s about engineering robust RAG pipelines.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most tutorials show you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load documents&lt;/li&gt;
&lt;li&gt;Embed them&lt;/li&gt;
&lt;li&gt;Store in a vector DB&lt;/li&gt;
&lt;li&gt;Ask GPT a question&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And boom. “You built RAG!”&lt;/p&gt;

&lt;p&gt;But in real-world &lt;strong&gt;LLM systems&lt;/strong&gt;, that’s barely step one.&lt;/p&gt;

&lt;p&gt;Production-grade Retrieval-Augmented Generation requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query rewriting&lt;/li&gt;
&lt;li&gt;Chunking strategies&lt;/li&gt;
&lt;li&gt;Hybrid search&lt;/li&gt;
&lt;li&gt;Reranking&lt;/li&gt;
&lt;li&gt;Evaluation pipelines&lt;/li&gt;
&lt;li&gt;Guardrails&lt;/li&gt;
&lt;li&gt;Latency optimization&lt;/li&gt;
&lt;li&gt;Cost governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s &lt;a href="https://dextralabs.com/blog/best-open-source-llm-model/" rel="noopener noreferrer"&gt;LLM engineering&lt;/a&gt;, not copy-paste coding.&lt;/p&gt;

&lt;p&gt;If you want to build serious &lt;strong&gt;enterprise AI architecture&lt;/strong&gt;, you need projects that simulate production realities.&lt;/p&gt;

&lt;p&gt;Let’s fix that.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7 RAG Projects That Teach Real Retrieval Engineering&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Each project below escalates your understanding from beginner to advanced &lt;strong&gt;RAG pipeline design&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Build a “Why Did It Answer That?” Debuggable RAG System&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval transparency&lt;/li&gt;
&lt;li&gt;Embedding diagnostics&lt;/li&gt;
&lt;li&gt;Similarity score interpretation&lt;/li&gt;
&lt;li&gt;Prompt trace logging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build It&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/rag-to-context-engineering-for-agentic-ai/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt;&lt;/strong&gt; app that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shows top-k retrieved chunks&lt;/li&gt;
&lt;li&gt;Displays similarity scores&lt;/li&gt;
&lt;li&gt;Logs prompt + retrieved context&lt;/li&gt;
&lt;li&gt;Highlights hallucinated spans&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding comparison experiments&lt;/li&gt;
&lt;li&gt;Chunk-size A/B testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;Observability&lt;/strong&gt; in &lt;strong&gt;LLM systems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most enterprise teams fail because they cannot debug retrieval failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Hybrid Search RAG (Vector + BM25)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sparse vs dense retrieval&lt;/li&gt;
&lt;li&gt;Keyword fallback&lt;/li&gt;
&lt;li&gt;Search fusion strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ElasticSearch BM25&lt;/li&gt;
&lt;li&gt;Vector DB (Pinecone / Weaviate / FAISS)&lt;/li&gt;
&lt;li&gt;Reciprocal rank fusion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because vector search alone fails when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exact terms matter&lt;/li&gt;
&lt;li&gt;Legal clauses require precision&lt;/li&gt;
&lt;li&gt;Code snippets depend on syntax&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;Search engineering inside AI systems&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Enterprise Policy Copilot (Access-Controlled RAG)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-tenant architecture&lt;/li&gt;
&lt;li&gt;Metadata filtering&lt;/li&gt;
&lt;li&gt;Role-based retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HR policy assistant&lt;/li&gt;
&lt;li&gt;Department-level filtering&lt;/li&gt;
&lt;li&gt;Row-level access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT-auth metadata filters&lt;/li&gt;
&lt;li&gt;Audit logging&lt;/li&gt;
&lt;li&gt;Retrieval tracking per user&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;Enterprise AI architecture&lt;/strong&gt; fundamentals&lt;/p&gt;

&lt;p&gt;This is where many startups collapse, they forget security in LLM engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. AI Code Review Assistant (Context-Aware RAG)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code chunking strategies&lt;/li&gt;
&lt;li&gt;AST-based splitting&lt;/li&gt;
&lt;li&gt;Dependency graph retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Build&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub PR analyzer&lt;/li&gt;
&lt;li&gt;Retrieve related files&lt;/li&gt;
&lt;li&gt;Inject historical bug patterns&lt;/li&gt;
&lt;li&gt;Suggest refactors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enhance with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vectorizing commit history&lt;/li&gt;
&lt;li&gt;Indexing architecture docs&lt;/li&gt;
&lt;li&gt;Linking code comments to test coverage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;AI code review systems&lt;/strong&gt; at scale&lt;/p&gt;

&lt;p&gt;This is the difference between a toy bot and a real engineering assistant.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Query-Rewriting RAG with an Agent Loop&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://dextralabs.com/blog/what-is-ai-agent-orchestration/" rel="noopener noreferrer"&gt;AI agents orchestration&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Self-reflection&lt;/li&gt;
&lt;li&gt;Iterative retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implement:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User question&lt;/li&gt;
&lt;li&gt;LLM rewrites query&lt;/li&gt;
&lt;li&gt;Retrieval step&lt;/li&gt;
&lt;li&gt;Rerank&lt;/li&gt;
&lt;li&gt;If low confidence → retry &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query decomposition&lt;/li&gt;
&lt;li&gt;Tool-based retrieval routing&lt;/li&gt;
&lt;li&gt;Multi-hop reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;AI agents + RAG pipeline fusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modern LLM systems don’t retrieve once. They retrieve strategically.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Evaluation-First RAG System&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval metrics (Recall@k, MRR)&lt;/li&gt;
&lt;li&gt;LLM evaluation loops&lt;/li&gt;
&lt;li&gt;Hallucination scoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ground-truth QA dataset&lt;/li&gt;
&lt;li&gt;Automatic scoring&lt;/li&gt;
&lt;li&gt;Retrieval accuracy dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost per query&lt;/li&gt;
&lt;li&gt;Token usage&lt;/li&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Retrieval hit rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: &lt;strong&gt;Production-grade LLM engineering&lt;/strong&gt; mindset&lt;/p&gt;

&lt;p&gt;If you’re not measuring, you’re guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Multi-Modal RAG (Documents + Tables + Images)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What You Learn&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured retrieval&lt;/li&gt;
&lt;li&gt;Table-aware chunking&lt;/li&gt;
&lt;li&gt;Image embedding indexing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial report assistant&lt;/li&gt;
&lt;li&gt;Retrieve charts&lt;/li&gt;
&lt;li&gt;Interpret tables&lt;/li&gt;
&lt;li&gt;Answer cross-document questions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OCR ingestion&lt;/li&gt;
&lt;li&gt;Structured metadata&lt;/li&gt;
&lt;li&gt;Query routing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real skill gained: Next-gen &lt;strong&gt;enterprise AI systems&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Real Retrieval Engineering Actually Looks Like&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the mental model shift:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Toy RAG&lt;/th&gt;
&lt;th&gt;Real RAG Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embed + store&lt;/td&gt;
&lt;td&gt;Chunk strategy experiments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-k retrieval&lt;/td&gt;
&lt;td&gt;Reranking + fusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One prompt&lt;/td&gt;
&lt;td&gt;Agent loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No logging&lt;/td&gt;
&lt;td&gt;Full observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No metrics&lt;/td&gt;
&lt;td&gt;Retrieval evaluation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No auth&lt;/td&gt;
&lt;td&gt;Enterprise-grade security&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you want to work in serious &lt;strong&gt;LLM engineering roles&lt;/strong&gt;, you must understand this difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The RAG Pipeline Blueprint (Production Version)&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
   ↓
Query Rewriting Agent
   ↓
Retriever Router (Vector / BM25 / Graph)
   ↓
Hybrid Retrieval
   ↓
Reranker
   ↓
Context Compression
   ↓
LLM Generation
   ↓
Evaluation &amp;amp; Logging
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s not a tutorial project.&lt;/p&gt;

&lt;p&gt;That’s a system.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where Most Companies Need Help&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In practice, enterprises struggle with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scaling RAG across millions of documents&lt;/li&gt;
&lt;li&gt;Latency optimization&lt;/li&gt;
&lt;li&gt;Cost governance&lt;/li&gt;
&lt;li&gt;Access control&lt;/li&gt;
&lt;li&gt;Security compliance&lt;/li&gt;
&lt;li&gt;Hallucination mitigation&lt;/li&gt;
&lt;li&gt;AI code review automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where specialized AI consulting becomes critical.&lt;/p&gt;

&lt;p&gt;Teams working on advanced &lt;strong&gt;LLM systems and enterprise AI architecture&lt;/strong&gt; often partner with firms like *&lt;em&gt;**&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt;&lt;/em&gt;&lt;em&gt;, an AI consulting company focused on production-grade LLM engineering&lt;/em&gt;* and scalable RAG pipeline design to avoid costly architectural mistakes early.&lt;/p&gt;

&lt;p&gt;Because rewriting your AI architecture six months later is far more expensive than designing it correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Advanced Extensions (If You Want to Stand Out)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you really want to differentiate yourself in LLM engineering interviews:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement a reranker (Cross-Encoder)&lt;/li&gt;
&lt;li&gt;Add semantic caching&lt;/li&gt;
&lt;li&gt;Build retrieval benchmarking harness&lt;/li&gt;
&lt;li&gt;Add synthetic query generation&lt;/li&gt;
&lt;li&gt;Build a hallucination classifier&lt;/li&gt;
&lt;li&gt;Implement graph-based RAG&lt;/li&gt;
&lt;li&gt;Add streaming retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thought: RAG Is Search Engineering in Disguise&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Retrieval-Augmented Generation is not about adding context.&lt;/p&gt;

&lt;p&gt;It’s about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Information retrieval science&lt;/li&gt;
&lt;li&gt;Distributed systems&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Security&lt;/li&gt;
&lt;li&gt;Agent orchestration&lt;/li&gt;
&lt;li&gt;Cost optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The future of AI agents and enterprise AI architecture depends on engineers who understand this deeply.&lt;/p&gt;

&lt;p&gt;Build these projects.&lt;/p&gt;

&lt;p&gt;Break them.&lt;/p&gt;

&lt;p&gt;Measure them.&lt;/p&gt;

&lt;p&gt;Optimize them.&lt;/p&gt;

&lt;p&gt;That’s real retrieval engineering.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Agentic AI Architecture: From CLI Tools to Enterprise Systems</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Fri, 27 Feb 2026 19:33:08 +0000</pubDate>
      <link>https://dev.to/dextralabs/agentic-ai-architecture-from-cli-tools-to-enterprise-systems-9p</link>
      <guid>https://dev.to/dextralabs/agentic-ai-architecture-from-cli-tools-to-enterprise-systems-9p</guid>
      <description>&lt;p&gt;&lt;em&gt;The era of AI-native software isn’t coming.&lt;br&gt;
It’s already here.&lt;br&gt;
And it’s agentic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;From scrappy CLI copilots to fully autonomous enterprise workflows, &lt;strong&gt;AI agents are reshaping software architecture itself&lt;/strong&gt;. But building reliable, scalable, production-grade &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/best-llm-models/" rel="noopener noreferrer"&gt;LLM&lt;/a&gt;&lt;/strong&gt; systems isn’t just about plugging in an API key.&lt;/p&gt;

&lt;p&gt;It’s about architecture.&lt;/p&gt;

&lt;p&gt;In this deep dive, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/agentic-ai/" rel="noopener noreferrer"&gt;Agentic AI Architecture&lt;/a&gt;&lt;/strong&gt; really means&lt;/li&gt;
&lt;li&gt;How we move from CLI tools → production systems&lt;/li&gt;
&lt;li&gt;How to design scalable RAG pipelines&lt;/li&gt;
&lt;li&gt;What enterprise AI architecture looks like&lt;/li&gt;
&lt;li&gt;Why AI code review and governance matter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And how AI consulting firms like Dextra Labs help companies operationalize this shift&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;What Is Agentic AI Architecture?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Agentic AI architecture refers to systems where LLM-powered agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Perceive context&lt;/li&gt;
&lt;li&gt;Reason over goals&lt;/li&gt;
&lt;li&gt;Take actions via tools&lt;/li&gt;
&lt;li&gt;Learn from feedback&lt;/li&gt;
&lt;li&gt;Coordinate across systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike traditional ML pipelines, agentic systems are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional ML&lt;/th&gt;
&lt;th&gt;Agentic AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Static models&lt;/td&gt;
&lt;td&gt;Dynamic agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single prediction&lt;/td&gt;
&lt;td&gt;Multi-step reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No tool use&lt;/td&gt;
&lt;td&gt;Tool invocation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch inference&lt;/td&gt;
&lt;td&gt;Interactive execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Isolated outputs&lt;/td&gt;
&lt;td&gt;Workflow orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In short:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLMs generate text. AI agents execute intent.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Phase 1: The CLI AI Tool Era&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We all started here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI CLI wrappers&lt;/li&gt;
&lt;li&gt;Git commit summarizers&lt;/li&gt;
&lt;li&gt;Local RAG search tools&lt;/li&gt;
&lt;li&gt;Terminal copilots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools typically include:&lt;/p&gt;

&lt;p&gt;User → Prompt → LLM API → Output → Done&lt;/p&gt;

&lt;p&gt;They’re powerful — but limited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No persistent memory&lt;/li&gt;
&lt;li&gt;No multi-step planning&lt;/li&gt;
&lt;li&gt;No tool orchestration&lt;/li&gt;
&lt;li&gt;No observability&lt;/li&gt;
&lt;li&gt;No governance&lt;/li&gt;
&lt;li&gt;No enterprise guardrails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where many startups stop.&lt;/p&gt;

&lt;p&gt;But enterprises can’t.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Phase 2: RAG Pipelines — The First Leap&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To build production-grade LLM systems, we need structured retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Modern RAG Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Document ingestion&lt;/li&gt;
&lt;li&gt;Chunking + embedding&lt;/li&gt;
&lt;li&gt;Vector storage&lt;/li&gt;
&lt;li&gt;Retrieval&lt;/li&gt;
&lt;li&gt;Prompt augmentation&lt;/li&gt;
&lt;li&gt;LLM response generation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But in enterprise AI architecture, that’s just the beginning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advanced RAG Engineering Includes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid search (vector + keyword)&lt;/li&gt;
&lt;li&gt;Query rewriting&lt;/li&gt;
&lt;li&gt;Context compression&lt;/li&gt;
&lt;li&gt;Multi-hop retrieval&lt;/li&gt;
&lt;li&gt;Caching layers&lt;/li&gt;
&lt;li&gt;Guardrail filters&lt;/li&gt;
&lt;li&gt;Evaluation frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where LLM engineering becomes a discipline, not just prompt tinkering.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Phase 3: True AI Agents&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now we go beyond retrieval.&lt;/p&gt;

&lt;p&gt;Agentic systems add:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory layers&lt;/li&gt;
&lt;li&gt;Tool calling&lt;/li&gt;
&lt;li&gt;Reflection loops&lt;/li&gt;
&lt;li&gt;Task decomposition&lt;/li&gt;
&lt;li&gt;Monitoring and evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example Architecture&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request
   ↓
Planner Agent
   ↓
Tool Executor Agent
   ↓
Knowledge Agent (RAG)
   ↓
Validator Agent
   ↓
Final Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You’re no longer building a chatbot.&lt;/p&gt;

&lt;p&gt;You’re building a distributed reasoning system.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Enterprise AI Architecture: What Changes?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When scaling from startup to enterprise:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security &amp;amp; Compliance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII masking&lt;/li&gt;
&lt;li&gt;Audit logs&lt;/li&gt;
&lt;li&gt;Data isolation&lt;/li&gt;
&lt;li&gt;Access control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt tracking&lt;/li&gt;
&lt;li&gt;Hallucination detection&lt;/li&gt;
&lt;li&gt;Agent performance scoring&lt;/li&gt;
&lt;li&gt;Feedback loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model routing (open-source vs proprietary)&lt;/li&gt;
&lt;li&gt;Latency optimization&lt;/li&gt;
&lt;li&gt;Cost governance&lt;/li&gt;
&lt;li&gt;Horizontal scaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Evaluation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Synthetic testing&lt;/li&gt;
&lt;li&gt;Prompt regression tests&lt;/li&gt;
&lt;li&gt;AI code review pipelines&lt;/li&gt;
&lt;li&gt;Human-in-the-loop validation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where most AI-native startups struggle.&lt;/p&gt;

&lt;p&gt;Because building cool demos ≠ building reliable systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;AI Code Review in the Agentic Era&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One under-discussed layer: AI reviewing AI.&lt;/p&gt;

&lt;p&gt;AI code review in modern LLM systems can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyze prompt logic&lt;/li&gt;
&lt;li&gt;Detect tool misuse&lt;/li&gt;
&lt;li&gt;Identify unsafe agent actions&lt;/li&gt;
&lt;li&gt;Score hallucination risk&lt;/li&gt;
&lt;li&gt;Evaluate RAG quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enterprise AI architecture increasingly includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated LLM regression testing&lt;/li&gt;
&lt;li&gt;Model diff comparisons&lt;/li&gt;
&lt;li&gt;Prompt version control&lt;/li&gt;
&lt;li&gt;Behavior monitoring dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture now looks more like DevOps + ML Ops + Agent Ops.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The New LLM Systems Stack&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s break it down:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Foundation Layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM providers (OpenAI, Anthropic, open models)&lt;/li&gt;
&lt;li&gt;Embedding models&lt;/li&gt;
&lt;li&gt;Vector DBs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Orchestration Layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent frameworks&lt;/li&gt;
&lt;li&gt;Tool registries&lt;/li&gt;
&lt;li&gt;Memory stores&lt;/li&gt;
&lt;li&gt;Workflow engines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Governance Layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Safety filters&lt;/li&gt;
&lt;li&gt;Audit logging&lt;/li&gt;
&lt;li&gt;Access control&lt;/li&gt;
&lt;li&gt;Prompt versioning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Evaluation Layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Offline eval datasets&lt;/li&gt;
&lt;li&gt;LLM-as-judge scoring&lt;/li&gt;
&lt;li&gt;AI code review agents&lt;/li&gt;
&lt;li&gt;Monitoring dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Application Layer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support agents&lt;/li&gt;
&lt;li&gt;Internal copilots&lt;/li&gt;
&lt;li&gt;Sales automation&lt;/li&gt;
&lt;li&gt;Knowledge assistants
Each layer introduces complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that’s where strategic AI consulting becomes critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How AI-Native Companies Get This Wrong&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Common mistakes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treating LLM engineering as prompt engineering&lt;/li&gt;
&lt;li&gt;Ignoring &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/rag-pipeline-explained-diagram-implementation/" rel="noopener noreferrer"&gt;RAG pipeline&lt;/a&gt;&lt;/strong&gt; optimization&lt;/li&gt;
&lt;li&gt;No evaluation framework&lt;/li&gt;
&lt;li&gt;No model routing strategy&lt;/li&gt;
&lt;li&gt;No enterprise AI architecture design&lt;/li&gt;
&lt;li&gt;No human oversight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinations in production&lt;/li&gt;
&lt;li&gt;Cost explosions&lt;/li&gt;
&lt;li&gt;Security risks&lt;/li&gt;
&lt;li&gt;Broken automation loops&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Building It Right: The Dextra Labs Approach&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is where firms like &lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt;&lt;/strong&gt; step in.&lt;/p&gt;

&lt;p&gt;Instead of building surface-level AI features, Dextra Labs focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Production-grade LLM engineering&lt;/li&gt;
&lt;li&gt;Robust RAG pipeline design&lt;/li&gt;
&lt;li&gt;Multi-agent system architecture&lt;/li&gt;
&lt;li&gt;AI code review systems&lt;/li&gt;
&lt;li&gt;Enterprise AI architecture modernization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They specialize in helping companies transition:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;From AI experiments → to AI-native operating systems.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you’re exploring large-scale LLM systems, a structured architecture roadmap is critical.&lt;/p&gt;

&lt;p&gt;You can explore &lt;strong&gt;enterprise AI architecture consulting or LLM engineering services&lt;/strong&gt; to understand how production AI systems are built with scalability, governance, and evaluation in mind.&lt;/p&gt;

&lt;p&gt;For organizations modernizing internal tools, &lt;strong&gt;AI code review automation&lt;/strong&gt; and &lt;strong&gt;RAG pipeline optimization&lt;/strong&gt; are increasingly becoming competitive advantages.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Software Due Diligence in the Age of AI-Native Companies&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you're an investor or acquirer, your due diligence checklist must evolve.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;“Do they use AI?”&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You should ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How is their RAG pipeline designed?&lt;/li&gt;
&lt;li&gt;Do they have hallucination evaluation?&lt;/li&gt;
&lt;li&gt;What is their agent orchestration model?&lt;/li&gt;
&lt;li&gt;How do they handle model versioning?&lt;/li&gt;
&lt;li&gt;Is AI code review part of CI/CD?&lt;/li&gt;
&lt;li&gt;What is their enterprise AI architecture maturity?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI-native companies are being valued not just on features, &lt;br&gt;
but on architectural robustness.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Future: Autonomous Enterprise Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We’re heading toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Self-healing AI workflows&lt;/li&gt;
&lt;li&gt;Cross-department agent collaboration&lt;/li&gt;
&lt;li&gt;Autonomous internal copilots&lt;/li&gt;
&lt;li&gt;AI-native ERP overlays&lt;/li&gt;
&lt;li&gt;Dynamic reasoning systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agentic AI architecture won’t be optional.&lt;/p&gt;

&lt;p&gt;It will be foundational.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The shift from CLI AI tools to enterprise agentic systems is not incremental.&lt;/p&gt;

&lt;p&gt;It’s architectural.&lt;/p&gt;

&lt;p&gt;If you’re building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI copilots&lt;/li&gt;
&lt;li&gt;Internal knowledge assistants&lt;/li&gt;
&lt;li&gt;AI workflow automation&lt;/li&gt;
&lt;li&gt;Multi-agent systems&lt;/li&gt;
&lt;li&gt;LLM-powered SaaS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You are no longer just building features.&lt;/p&gt;

&lt;p&gt;You’re designing intelligence infrastructure.&lt;/p&gt;

&lt;p&gt;And that demands serious LLM engineering, scalable RAG pipelines, evaluation frameworks, and enterprise AI architecture thinking.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Building AI Code Review Systems That Developers Trust</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Sun, 22 Feb 2026 07:50:02 +0000</pubDate>
      <link>https://dev.to/dextralabs/building-ai-code-review-systems-that-developers-trust-6mh</link>
      <guid>https://dev.to/dextralabs/building-ai-code-review-systems-that-developers-trust-6mh</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Because shipping AI reviewers is easy.&lt;br&gt;
Earning developer trust? That’s the real engineering challenge.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Modern teams are experimenting with AI code review, from inline suggestions to autonomous pull request analysis.&lt;/p&gt;

&lt;p&gt;But here’s the truth:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Developers don’t trust AI just because it’s “powered by GPT.”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Trust is built through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable behavior&lt;/li&gt;
&lt;li&gt;Context awareness&lt;/li&gt;
&lt;li&gt;Transparent reasoning&lt;/li&gt;
&lt;li&gt;Low hallucination rates&lt;/li&gt;
&lt;li&gt;Clear boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this blog, we’ll break down &lt;strong&gt;how to design production-grade AI code review systems&lt;/strong&gt; that developers rely on, not ignore.&lt;/p&gt;

&lt;p&gt;Let’s build this the right way.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Why AI Code Review Often Fails&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before we design trust, let’s diagnose failure.&lt;/p&gt;

&lt;p&gt;Most early AI reviewers fail because they:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lack repository context&lt;/li&gt;
&lt;li&gt;Ignore project coding standards&lt;/li&gt;
&lt;li&gt;Hallucinate vulnerabilities&lt;/li&gt;
&lt;li&gt;Suggest outdated patterns&lt;/li&gt;
&lt;li&gt;Don’t explain reasoning&lt;/li&gt;
&lt;li&gt;Over-comment trivial issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Developers quickly learn to mute them.&lt;/p&gt;

&lt;p&gt;The problem isn’t the model.&lt;/p&gt;

&lt;p&gt;It’s poor &lt;strong&gt;LLM engineering&lt;/strong&gt; and weak &lt;strong&gt;enterprise AI architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Architecture of a Trustworthy AI Code Review System&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s zoom out and look at a robust system design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core Components&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM (reasoning engine)&lt;/li&gt;
&lt;li&gt;RAG pipeline for repository grounding&lt;/li&gt;
&lt;li&gt;Static analysis integration&lt;/li&gt;
&lt;li&gt;Policy engine (team rules)&lt;/li&gt;
&lt;li&gt;Feedback learning loop&lt;/li&gt;
&lt;li&gt;Explainability layer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn’t just “call an API and hope.”&lt;/p&gt;

&lt;p&gt;It’s a structured &lt;strong&gt;LLM system&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Step 1: Ground the Model with a RAG Pipeline&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Raw LLMs don’t know your:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal libraries&lt;/li&gt;
&lt;li&gt;Coding guidelines&lt;/li&gt;
&lt;li&gt;Architecture decisions&lt;/li&gt;
&lt;li&gt;Security policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s where a &lt;strong&gt;RAG pipeline&lt;/strong&gt; changes everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How It Works in Code Review&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Developer opens PR&lt;/li&gt;
&lt;li&gt;Changed files are chunked&lt;/li&gt;
&lt;li&gt;Related files are retrieved&lt;/li&gt;
&lt;li&gt;Relevant documentation is fetched&lt;/li&gt;
&lt;li&gt;Context is embedded and passed to LLM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instead of generic advice:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;“Consider improving performance”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You get:&lt;/p&gt;

&lt;p&gt;“In /services/payment.ts, we standardize async error handling with wrapAsync(). This PR uses a try/catch block directly, consider aligning with team pattern.”&lt;/p&gt;

&lt;p&gt;That’s trust.&lt;/p&gt;

&lt;p&gt;Because it’s grounded.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. AI Agents vs Single LLM Calls&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you want serious results, don’t rely on one-shot prompts.&lt;/p&gt;

&lt;p&gt;Use &lt;strong&gt;AI agents&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Agent Roles in Code Review&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Security Agent&lt;/li&gt;
&lt;li&gt;Performance Agent&lt;/li&gt;
&lt;li&gt;Style &amp;amp; Convention Agent&lt;/li&gt;
&lt;li&gt;Test Coverage Agent&lt;/li&gt;
&lt;li&gt;Architecture Consistency Agent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has its own system prompt&lt;/li&gt;
&lt;li&gt;Pulls different retrieval context&lt;/li&gt;
&lt;li&gt;Applies specialized reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then results are merged intelligently.&lt;/p&gt;

&lt;p&gt;This modular design improves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Precision&lt;/li&gt;
&lt;li&gt;Explainability&lt;/li&gt;
&lt;li&gt;Maintainability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is modern &lt;strong&gt;LLM engineering&lt;/strong&gt; in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. Enterprise AI Architecture Considerations&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you're building for real organizations (not hackathons), you must consider:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security &amp;amp; Compliance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code never leaves VPC&lt;/li&gt;
&lt;li&gt;On-prem or private model deployment&lt;/li&gt;
&lt;li&gt;Encrypted embedding stores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;False positives&lt;/li&gt;
&lt;li&gt;Acceptance rate&lt;/li&gt;
&lt;li&gt;Developer overrides&lt;/li&gt;
&lt;li&gt;Hallucination patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Feedback Loops&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let developers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept suggestions&lt;/li&gt;
&lt;li&gt;Reject with reason&lt;/li&gt;
&lt;li&gt;Rate quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This data retrains prompt strategies and fine-tunes models.&lt;/p&gt;

&lt;p&gt;Without feedback loops?&lt;br&gt;
Trust erodes fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. Measuring Trust (Yes, It’s Measurable)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You can quantify trust using:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Tells You&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Suggestion Acceptance Rate&lt;/td&gt;
&lt;td&gt;Real usefulness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Override Frequency&lt;/td&gt;
&lt;td&gt;Noise level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-to-Merge Reduction&lt;/td&gt;
&lt;td&gt;Productivity gain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer Sentiment&lt;/td&gt;
&lt;td&gt;Qualitative trust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucination Incidents&lt;/td&gt;
&lt;td&gt;System reliability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If acceptance &amp;lt; 30%&lt;br&gt;
You don’t have AI.&lt;br&gt;
You have spam.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. Common Mistakes in AI Code Review Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s save you months of pain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: No Contextual Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix → Invest in strong RAG pipeline design&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Overly Generic Prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix → Role-specific agent prompts&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: No Guardrails&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix → Combine static analysis + LLM reasoning&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: No Human Override&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fix → Make AI assistive, not authoritative&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. Real-World Insight: What Works in Production&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Teams that successfully deploy AI code review systems usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with one focused problem (e.g., security scanning)&lt;/li&gt;
&lt;li&gt;Build modular agents&lt;/li&gt;
&lt;li&gt;Integrate static analyzers (ESLint, SonarQube, etc.)&lt;/li&gt;
&lt;li&gt;Keep human reviewers in the loop&lt;/li&gt;
&lt;li&gt;Continuously refine retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where experienced AI consulting makes a difference.&lt;/p&gt;

&lt;p&gt;For example, &lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt;&lt;/strong&gt; works with engineering teams to design production-grade AI systems, from robust &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/ai-agents-llm-rag-agentic-workflows/" rel="noopener noreferrer"&gt;LLM systems&lt;/a&gt;&lt;/strong&gt; to scalable &lt;strong&gt;&lt;a href="https://dextralabs.com/case-studies/scalable-ai-agent-architecture-dextralabs/" rel="noopener noreferrer"&gt;enterprise AI architecture&lt;/a&gt;&lt;/strong&gt;, ensuring models are grounded, secure, and actually trusted by developers.&lt;/p&gt;

&lt;p&gt;Instead of just adding AI as a feature, they focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval optimization&lt;/li&gt;
&lt;li&gt;Agent orchestration&lt;/li&gt;
&lt;li&gt;Secure deployment pipelines&lt;/li&gt;
&lt;li&gt;Governance layers for enterprise compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because in real organizations, architecture matters more than hype.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;9. Interactive Checklist: Is Your AI Reviewer Trustworthy?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Answer honestly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does it retrieve relevant repo context?&lt;/li&gt;
&lt;li&gt;Does it explain why a suggestion is made?&lt;/li&gt;
&lt;li&gt;Can developers give feedback?&lt;/li&gt;
&lt;li&gt;Are hallucinations tracked?&lt;/li&gt;
&lt;li&gt;Are different review concerns separated into agents?&lt;/li&gt;
&lt;li&gt;Is data secured within enterprise boundaries?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you checked fewer than 4…&lt;/p&gt;

&lt;p&gt;You’re experimenting, not engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;10. The Future of AI Code Review&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We’re moving toward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Autonomous PR summaries&lt;/li&gt;
&lt;li&gt;Risk scoring per change&lt;/li&gt;
&lt;li&gt;Intelligent reviewer assignment&lt;/li&gt;
&lt;li&gt;AI-generated test cases&lt;/li&gt;
&lt;li&gt;Architecture drift detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next wave won’t be “AI that comments.”&lt;/p&gt;

&lt;p&gt;It will be &lt;strong&gt;AI agents collaborating with developers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That shift requires disciplined &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/prompt-engineering-for-llm/" rel="noopener noreferrer"&gt;LLM engineering&lt;/a&gt;&lt;/strong&gt;, thoughtful &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/multimodal-rag-at-scale-enterprise-ai/" rel="noopener noreferrer"&gt;RAG pipeline&lt;/a&gt;&lt;/strong&gt; design, and strong &lt;strong&gt;enterprise AI architecture&lt;/strong&gt; foundations.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Developers don’t trust AI because it’s intelligent.&lt;/p&gt;

&lt;p&gt;They trust it because it’s:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context-aware&lt;/li&gt;
&lt;li&gt;Predictable&lt;/li&gt;
&lt;li&gt;Transparent&lt;/li&gt;
&lt;li&gt;Measurable&lt;/li&gt;
&lt;li&gt;Secure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Building an &lt;strong&gt;AI code review system&lt;/strong&gt; is not a prompt problem.&lt;/p&gt;

&lt;p&gt;It’s a systems engineering problem.&lt;/p&gt;

&lt;p&gt;And when done right?&lt;/p&gt;

&lt;p&gt;It becomes a force multiplier for engineering velocity.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How LLM Memory Actually Works in Production Systems</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Sat, 21 Feb 2026 20:53:20 +0000</pubDate>
      <link>https://dev.to/dextralabs/how-llm-memory-actually-works-in-production-systems-549d</link>
      <guid>https://dev.to/dextralabs/how-llm-memory-actually-works-in-production-systems-549d</guid>
      <description>&lt;p&gt;If you think LLMs "remember" things like humans do…&lt;br&gt;
 you're about to discover what really happens behind the scenes.&lt;/p&gt;

&lt;p&gt;Large Language Models feel intelligent. They reference context. They recall prior inputs. They adapt to tasks.&lt;/p&gt;

&lt;p&gt;But here's the truth:&lt;br&gt;
&lt;strong&gt;&lt;em&gt;LLMs don’t have memory. Systems do.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And understanding that difference is what separates hobby projects from production-grade &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/llm-embeddings-feature-engineering/" rel="noopener noreferrer"&gt;LLM engineering&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s break it down interactively.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;First: Do LLMs Actually Have Memory?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Short answer?&lt;br&gt;
 No.&lt;/p&gt;

&lt;p&gt;A base model like GPT or LLaMA:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Doesn’t store conversations permanently&lt;/li&gt;
&lt;li&gt;Doesn’t update its weights per user interaction&lt;/li&gt;
&lt;li&gt;Doesn’t "remember" you tomorrow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it does have is:&lt;/p&gt;

&lt;p&gt;✔ A context window&lt;br&gt;
✔ Token prediction capability&lt;br&gt;
✔ Statistical pattern recognition&lt;/p&gt;

&lt;p&gt;Everything else?&lt;br&gt;
 That’s system design.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Illusion of Memory in LLM Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When your chatbot remembers user preferences or your AI assistant recalls company policies…&lt;/p&gt;

&lt;p&gt;That’s not the model.&lt;/p&gt;

&lt;p&gt;That’s architecture.&lt;/p&gt;

&lt;p&gt;Modern LLM systems simulate memory using external components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector databases&lt;/li&gt;
&lt;li&gt;Session stores&lt;/li&gt;
&lt;li&gt;Retrieval layers&lt;/li&gt;
&lt;li&gt;Knowledge graphs&lt;/li&gt;
&lt;li&gt;Tool-use frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where production engineering begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The 4 Types of Memory in Production LLM Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s simplify what’s happening under the hood.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1.  Short-Term Memory (Context Window)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is the simplest form.&lt;/p&gt;

&lt;p&gt;The model sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your current prompt&lt;/li&gt;
&lt;li&gt;Previous messages in the thread&lt;/li&gt;
&lt;li&gt;Any injected system instructions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token-bound&lt;/li&gt;
&lt;li&gt;Expensive at scale&lt;/li&gt;
&lt;li&gt;Resets when conversation ends&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a durable memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2.  Retrieval Memory (RAG Pipeline)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Now we’re getting serious.&lt;/p&gt;

&lt;p&gt;In production, companies implement a &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/rag-pipeline-explained-diagram-implementation/" rel="noopener noreferrer"&gt;RAG pipeline&lt;/a&gt;&lt;/strong&gt; (Retrieval-Augmented Generation):&lt;br&gt;
Here’s the flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User asks a question&lt;/li&gt;
&lt;li&gt;System embeds the query&lt;/li&gt;
&lt;li&gt;Vector DB retrieves relevant documents&lt;/li&gt;
&lt;li&gt;Retrieved content is injected into prompt&lt;/li&gt;
&lt;li&gt;Model generates response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s how your AI “remembers” company knowledge.&lt;/p&gt;

&lt;p&gt;This architecture is foundational in modern enterprise AI architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why RAG Matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without RAG:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinations increase&lt;/li&gt;
&lt;li&gt;Answers become generic&lt;/li&gt;
&lt;li&gt;Compliance risk grows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With RAG:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grounded responses&lt;/li&gt;
&lt;li&gt;Updated knowledge without retraining&lt;/li&gt;
&lt;li&gt;Traceability for enterprise workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many organizations partner with experts in &lt;strong&gt;LLM engineering services&lt;/strong&gt; to design scalable RAG systems that handle millions of embeddings efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Long-Term Memory (External Storage)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/how-to-build-ai-agents/" rel="noopener noreferrer"&gt;AI agents&lt;/a&gt;&lt;/strong&gt;, memory goes further.&lt;br&gt;
Systems may store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User preferences&lt;/li&gt;
&lt;li&gt;Task history&lt;/li&gt;
&lt;li&gt;Workflow state&lt;/li&gt;
&lt;li&gt;Prior tool results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stored in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Databases&lt;/li&gt;
&lt;li&gt;Vector stores&lt;/li&gt;
&lt;li&gt;Graph systems&lt;/li&gt;
&lt;li&gt;Object storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then selectively retrieved and re-injected.&lt;br&gt;
This is essential for advanced &lt;strong&gt;AI agents&lt;/strong&gt; operating across sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Procedural Memory (Tools &amp;amp; Actions)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When an AI books meetings, queries APIs, or writes to databases…&lt;br&gt;
It’s using tool execution frameworks.&lt;/p&gt;

&lt;p&gt;Memory here means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knowing available tools&lt;/li&gt;
&lt;li&gt;Tracking tool outputs&lt;/li&gt;
&lt;li&gt;Deciding next steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This transforms an LLM from a chatbot → autonomous workflow engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Real Production Architecture Example&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s say you're building an AI-powered code review.&lt;/p&gt;

&lt;p&gt;Here’s what a robust AI code review system might include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GitHub webhook trigger&lt;/li&gt;
&lt;li&gt;Code chunking &amp;amp; embeddings&lt;/li&gt;
&lt;li&gt;Vector search against best-practices database&lt;/li&gt;
&lt;li&gt;Context injection into prompt&lt;/li&gt;
&lt;li&gt;Model evaluation&lt;/li&gt;
&lt;li&gt;Structured output formatting&lt;/li&gt;
&lt;li&gt;Feedback storage for future reviews&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Notice something?&lt;/p&gt;

&lt;p&gt;The “memory” lives outside the model.&lt;/p&gt;

&lt;p&gt;This is where strategic system design matters more than prompt engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Hidden Complexity of LLM Engineering&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Production systems must solve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token optimization&lt;/li&gt;
&lt;li&gt;Embedding drift&lt;/li&gt;
&lt;li&gt;Context compression&lt;/li&gt;
&lt;li&gt;Retrieval ranking&lt;/li&gt;
&lt;li&gt;Latency constraints&lt;/li&gt;
&lt;li&gt;Multi-agent orchestration&lt;/li&gt;
&lt;li&gt;Observability&lt;/li&gt;
&lt;li&gt;Security &amp;amp; PII handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why deploying AI at enterprise scale requires more than plugging into an API.&lt;/p&gt;

&lt;p&gt;Teams often consult specialized firms like &lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt; – AI Consulting &amp;amp; LLM Engineering Experts&lt;/strong&gt; to architect scalable RAG pipelines, AI agents, and enterprise-ready LLM systems that integrate securely with existing infrastructure.&lt;/p&gt;

&lt;p&gt;The real challenge isn’t calling the model.&lt;/p&gt;

&lt;p&gt;It’s designing the memory layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Memory Optimization Strategies in Enterprise AI Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s explore advanced techniques used in production:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory Compression&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Summarizing past conversations to reduce token load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hierarchical Retrieval&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Layered retrieval: vector search → re-ranking → summarization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Combining keyword + vector retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge Graph Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Structured relationship mapping for deeper reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feedback Loops&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Storing outputs to refine future prompts.&lt;br&gt;
These techniques define modern enterprise &lt;strong&gt;AI architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;AI Agents vs Static RAG Systems&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s clarify something important.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Static RAG&lt;/th&gt;
&lt;th&gt;AI Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single query-response&lt;/td&gt;
&lt;td&gt;Multi-step reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No action capability&lt;/td&gt;
&lt;td&gt;Tool execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stateless&lt;/td&gt;
&lt;td&gt;Stateful&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval only&lt;/td&gt;
&lt;td&gt;Planning + memory + execution&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Agents require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory buffers&lt;/li&gt;
&lt;li&gt;Planning modules&lt;/li&gt;
&lt;li&gt;Tool registry&lt;/li&gt;
&lt;li&gt;Execution tracking&lt;/li&gt;
&lt;li&gt;Error recovery logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which makes them exponentially more complex.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Common Mistakes in LLM System Design&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Overstuffing prompts&lt;/li&gt;
&lt;li&gt;Ignoring embedding quality&lt;/li&gt;
&lt;li&gt;No observability&lt;/li&gt;
&lt;li&gt;No fallback systems&lt;/li&gt;
&lt;li&gt;Treating LLM as source of truth&lt;/li&gt;
&lt;li&gt;Skipping governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Production-grade &lt;strong&gt;LLM engineering&lt;/strong&gt; is software architecture first, AI second.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Big Mental Model Shift&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Think of LLMs as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A reasoning engine&lt;/li&gt;
&lt;li&gt;With temporary working memory&lt;/li&gt;
&lt;li&gt;Powered by external memory modules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model is the brainstem.&lt;br&gt;
 The system is the brain.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where This Is Headed&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Future memory systems will include:&lt;/li&gt;
&lt;li&gt;Persistent personalized AI agents&lt;/li&gt;
&lt;li&gt;Federated memory layers&lt;/li&gt;
&lt;li&gt;Real-time streaming retrieval&lt;/li&gt;
&lt;li&gt;Multi-model orchestration&lt;/li&gt;
&lt;li&gt;Memory prioritization algorithms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Companies that master memory architecture will dominate AI adoption.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Takeaway&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you're building:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-powered products&lt;/li&gt;
&lt;li&gt;Enterprise copilots&lt;/li&gt;
&lt;li&gt;AI code review systems&lt;/li&gt;
&lt;li&gt;Multi-agent workflows&lt;/li&gt;
&lt;li&gt;Scalable RAG pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The question isn’t:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;“Which LLM should we use?”&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“How are we designing memory?”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s where real differentiation happens.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Production Lessons from Deploying LLMs in Regulated Environments</title>
      <dc:creator>Dextra Labs</dc:creator>
      <pubDate>Wed, 28 Jan 2026 13:38:58 +0000</pubDate>
      <link>https://dev.to/dextralabs/production-lessons-from-deploying-llms-in-regulated-environments-3kcn</link>
      <guid>https://dev.to/dextralabs/production-lessons-from-deploying-llms-in-regulated-environments-3kcn</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Shipping an LLM demo is easy. Shipping a compliant, auditable, production-grade LLM in a regulated industry? That’s where the real engineering begins.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dextralabs.com/blog/llm-deployment-and-solutions/" rel="noopener noreferrer"&gt;Large Language Models (LLMs)&lt;/a&gt;&lt;/strong&gt; are rapidly moving from experimentation to mission-critical systems in finance, healthcare, insurance, legal, energy, and government. But regulated environments raise the bar: compliance, explainability, auditability, security, and reliability are no longer “nice to have.”&lt;/p&gt;

&lt;p&gt;This article distills &lt;strong&gt;hard‑won production lessons from deploying LLMs&lt;/strong&gt; in regulated environments, what breaks, what scales, and what actually passes audits, written for engineers, architects, and tech leaders building real systems.&lt;/p&gt;

&lt;p&gt;Along the way, we’ll reference proven patterns from multi‑cloud deployments (AWS, Azure, GCP) and real‑world engineering practices adopted by teams working with Dextra Labs, an AI consulting firm specializing in production‑ready, compliant LLM systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Also Read: &lt;a href="https://dev.to/dextralabs/evaluating-llms-in-cicd-what-we-learned-the-hard-way-5gao"&gt;Evaluating LLMs in CI/CD: What We Learned the Hard Way&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Regulated Environments Are Different&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In regulated domains, LLM systems are judged not only by accuracy but by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data residency &amp;amp; privacy guarantees&lt;/li&gt;
&lt;li&gt;Deterministic behavior and traceability&lt;/li&gt;
&lt;li&gt;Human oversight and accountability&lt;/li&gt;
&lt;li&gt;Repeatable audits and incident forensics&lt;/li&gt;
&lt;li&gt;Vendor risk and model governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A prompt that works in a hackathon can fail spectacularly under &lt;strong&gt;SOC 2, HIPAA, GDPR, PCI‑DSS, or ISO 27001&lt;/strong&gt; scrutiny.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson 0 : Treat LLMs as production infrastructure, not APIs you casually call&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Also Read: &lt;a href="https://dev.to/dextralabs/observability-for-ai-agents-metrics-that-actually-matter-2l6h"&gt;Observability for AI Agents: Metrics That Actually Matter&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 1: Architecture Must Be Audit‑First, Not Model‑First&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Many teams start with: Which model should we use?&lt;/p&gt;

&lt;p&gt;In regulated environments, the better question is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;How will we explain, log, and reproduce every LLM decision?&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stateless inference services&lt;/li&gt;
&lt;li&gt;Immutable request/response logging&lt;/li&gt;
&lt;li&gt;Versioned prompts and models&lt;/li&gt;
&lt;li&gt;Correlation IDs across the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A common winning approach is a layered LLM architecture:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;UI / API Layer&lt;br&gt;
   ↓&lt;br&gt;
Policy &amp;amp; Validation Layer&lt;br&gt;
   ↓&lt;br&gt;
Prompt Orchestration Layer&lt;br&gt;
   ↓&lt;br&gt;
Model Runtime (Cloud / Private)&lt;br&gt;
   ↓&lt;br&gt;
Observability + Audit Store&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This pattern frequently implemented by teams following &lt;strong&gt;bolded anchor: LLM deployment best practices&lt;/strong&gt; makes audits survivable instead of terrifying.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 2: Data Privacy Is a System Property (Not a Checkbox)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Regulated deployments fail most often at data boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Goes Wrong in Production&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PII leaks into prompts&lt;/li&gt;
&lt;li&gt;Training data is reused implicitly by vendors&lt;/li&gt;
&lt;li&gt;Logs accidentally store sensitive text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What Works&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt‑time PII redaction &amp;amp; tokenization&lt;/li&gt;
&lt;li&gt;Field‑level encryption before inference&lt;/li&gt;
&lt;li&gt;Strict separation between inference data and analytics data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When deploying on AWS, Azure, or GCP, successful teams align LLM pipelines with existing &lt;strong&gt;VPC, Private Link, and KMS&lt;/strong&gt; strategies, extending cloud security posture rather than bypassing it.&lt;/p&gt;

&lt;p&gt;This is where Dextra Labs often steps in: helping enterprises design LLM workflows that inherit compliance from their cloud infrastructure instead of reinventing security from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 3: Compliance Requires Explainability (Even If Models Aren’t Explainable)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;No regulator will accept:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;“The model said so.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical Explainability Techniques&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store retrieved documents in &lt;strong&gt;&lt;a href="https://dextralabs.com/blog/production-rag-in-2025-evaluation-cicd-observability/" rel="noopener noreferrer"&gt;RAG&lt;/a&gt;&lt;/strong&gt; systems&lt;/li&gt;
&lt;li&gt;Log prompt templates + variables&lt;/li&gt;
&lt;li&gt;Capture top‑k outputs and confidence heuristics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Explainability doesn’t mean opening the model weights, it means reconstructing &lt;strong&gt;why a response was generated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teams applying &lt;strong&gt;bolded anchor: retrieval‑augmented generation in production&lt;/strong&gt; consistently outperform black‑box chatbots during audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 4: Evaluation Is Continuous, Not Pre‑Launch&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Traditional ML validation happens before deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;LLMs require always‑on evaluation.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production‑Grade Evaluation Stack&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Golden datasets for regulated scenarios&lt;/li&gt;
&lt;li&gt;Policy‑based output validation&lt;/li&gt;
&lt;li&gt;Drift detection (semantic + statistical)&lt;/li&gt;
&lt;li&gt;Human‑in‑the‑loop escalation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A key insight from &lt;strong&gt;successful LLM applications&lt;/strong&gt; is that evaluation pipelines must ship alongside inference pipelines.&lt;/p&gt;

&lt;p&gt;If you can’t measure it in production, you can’t defend it in front of regulators.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 5: Prompt Engineering Needs Governance&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In regulated systems, prompts are not experiments, they are controlled artifacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat Prompts Like Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version control&lt;/li&gt;
&lt;li&gt;Peer review&lt;/li&gt;
&lt;li&gt;Rollback support&lt;/li&gt;
&lt;li&gt;Approval workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At scale, teams adopt prompt registries with metadata:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use case&lt;/li&gt;
&lt;li&gt;Risk classification&lt;/li&gt;
&lt;li&gt;Allowed data types&lt;/li&gt;
&lt;li&gt;Model compatibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This governance‑first approach aligned with &lt;strong&gt;bolded anchor: enterprise LLM governance frameworks&lt;/strong&gt;,  prevents silent regressions that could trigger compliance incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 6: Multi‑Cloud &amp;amp; Vendor Flexibility Is a Risk Strategy&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Regulators increasingly ask:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;What happens if your model provider changes terms or fails?&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart Production Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Abstract model providers behind a runtime layer&lt;/li&gt;
&lt;li&gt;Support OpenAI, Azure OpenAI, Anthropic, and open‑source models&lt;/li&gt;
&lt;li&gt;Keep prompts portable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Insights from multi‑cloud LLM deployment on &lt;strong&gt;AWS, Azure, and GCP&lt;/strong&gt; show that model portability is not just a cost optimization, it’s a regulatory safety net.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dextralabs.com/" rel="noopener noreferrer"&gt;Dextra Labs&lt;/a&gt;&lt;/strong&gt; frequently helps teams design vendor‑neutral LLM platforms so compliance doesn’t hinge on a single provider.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 7: Incident Response Must Include the LLM&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;When something goes wrong, auditors will ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who approved the prompt?&lt;/li&gt;
&lt;li&gt;Which model version was used?&lt;/li&gt;
&lt;li&gt;What data influenced the response?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM‑Aware Incident Playbooks&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kill switches for high‑risk use cases&lt;/li&gt;
&lt;li&gt;Rate limiting on sensitive workflows&lt;/li&gt;
&lt;li&gt;Real‑time monitoring of unsafe outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your incident response plan ignores LLMs, it’s incomplete.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Lesson 8: Start Narrow, Then Earn Trust&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The most successful regulated deployments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with low‑risk, high‑value use cases&lt;/li&gt;
&lt;li&gt;Prove compliance early&lt;/li&gt;
&lt;li&gt;Expand scope incrementally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal knowledge assistants&lt;/li&gt;
&lt;li&gt;Policy summarization tools&lt;/li&gt;
&lt;li&gt;Developer copilots with read‑only access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Trust is accumulated, not assumed.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Where AI Consulting Actually Helps&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Building compliant LLM systems is less about models and more &lt;strong&gt;about systems thinking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An experienced AI &lt;strong&gt;consulting partner&lt;/strong&gt; like Dextra Labs helps organizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Translate regulations into technical controls&lt;/li&gt;
&lt;li&gt;Design audit‑ready LLM architectures&lt;/li&gt;
&lt;li&gt;Deploy securely across AWS, Azure, and GCP&lt;/li&gt;
&lt;li&gt;Operationalize evaluation, monitoring, and governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn’t just to deploy an LLM, it’s to ship AI systems regulators won’t shut down.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Final Takeaways&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Regulated environments demand engineering discipline, not experimentation&lt;/li&gt;
&lt;li&gt;Observability, governance, and security matter more than model choice&lt;/li&gt;
&lt;li&gt;LLM success in production is 80% architecture, 20% AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you treat LLMs like infrastructure, compliance becomes manageable. If you treat them like magic, audits will be brutal.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
