<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Seenivasa Ramadurai</title>
    <description>The latest articles on DEV Community by Seenivasa Ramadurai (@sreeni5018).</description>
    <link>https://dev.to/sreeni5018</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1829954%2Fe57edf87-9dae-48c9-a528-0f57f54aac70.png</url>
      <title>DEV Community: Seenivasa Ramadurai</title>
      <link>https://dev.to/sreeni5018</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sreeni5018"/>
    <language>en</language>
    <item>
      <title>Most Enterprise AI Agents Fail in Production for the Same Reason And It's Not the Model</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Fri, 19 Jun 2026 01:49:33 +0000</pubDate>
      <link>https://dev.to/sreeni5018/most-enterprise-ai-agents-fail-in-production-for-the-same-reason-and-its-not-the-model-4ad7</link>
      <guid>https://dev.to/sreeni5018/most-enterprise-ai-agents-fail-in-production-for-the-same-reason-and-its-not-the-model-4ad7</guid>
      <description>&lt;h2&gt;
  
  
  Because intelligence alone is never enough.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F542evrgz6cgnyyuq53tc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F542evrgz6cgnyyuq53tc.png" alt=" " width="799" height="168"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's a question I keep hearing from enterprise teams who are just starting to productionize  AI agents:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;"We've got great prompts. The model performs well in testing. Why does it still fail in production?"&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The reason is almost always the same &lt;strong&gt;they built the intelligence.&lt;/strong&gt; &lt;strong&gt;They didn't build the system around it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's it. That's the whole failure pattern. The model is fine. The engineering discipline surrounding it wasn't applied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's the analogy I use to explain the difference.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  An AI Agent Is a Self-Driving Car
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not metaphorically. Structurally.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both operate in dynamic, unpredictable environments.&lt;/strong&gt; &lt;strong&gt;Both make&lt;/strong&gt; &lt;strong&gt;real time decisions with incomplete information&lt;/strong&gt;. Both can fail not because they're dumb, but because the environment surprises them in ways nobody anticipated. And in both cases, the intelligence of the system (the model, the sensors, the neural net) is only one layer of what makes it trustworthy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fod6bmgridtsr4snvi2ct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fod6bmgridtsr4snvi2ct.png" alt=" " width="800" height="60"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you break it down, &lt;strong&gt;three distinct engineering disciplines make a self-driving car work.&lt;/strong&gt; The same three disciplines make an &lt;strong&gt;AI agent work.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: Prompt Engineering = Destination and Driving Instructions
&lt;/h2&gt;

&lt;p&gt;Before you put a self driving car on the road, you configure it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where are we going?&lt;/li&gt;
&lt;li&gt;Which route is preferred?&lt;/li&gt;
&lt;li&gt;What's the speed limit?&lt;/li&gt;
&lt;li&gt;Are there constraints? (No highways. No toll roads. Arrive by 3 PM.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fo2tp79vfgtrekbubs818.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fo2tp79vfgtrekbubs818.png" alt=" " width="766" height="1716"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The car doesn't invent the mission. You give it one precisely, explicitly, in a format it can act on.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering does exactly the same thing for an AI agent. It defines:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The goal and scope of the task&lt;br&gt;
The rules and constraints it must follow&lt;br&gt;
The persona and tone it should operate with&lt;br&gt;
The guardrails that bound its behavior&lt;br&gt;
The expected format and outcome of its output&lt;/p&gt;

&lt;p&gt;Without clear prompts, the agent does what a car does without a destination it moves, but not toward anything useful. It might wander into edge cases, confabulate, or execute the wrong task with full confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; An &lt;strong&gt;Ecommerce support agent&lt;/strong&gt; told only to "help customers" will happily process a refund, cancel an active shipment, and escalate to a manager all for the same complaint because nobody told it which action to take first, or when escalation is appropriate. The model is working fine. The briefing failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering is the briefing. It's not optional, and it's not a one-time job. As your tasks evolve, so should the prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2: Context Engineering = Situational Awareness
&lt;/h2&gt;

&lt;p&gt;A self-driving car with perfect instructions will still crash if it can't see what's around it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why autonomous vehicles carry:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GPS and real-time maps&lt;/li&gt;
&lt;li&gt;Lidar and radar sensors&lt;/li&gt;
&lt;li&gt;Camera feeds processing the road ahead&lt;/li&gt;
&lt;li&gt;Weather and road condition data&lt;/li&gt;
&lt;li&gt;Traffic pattern feeds&lt;/li&gt;
&lt;li&gt;Pedestrian detection systems&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fc5ep2bv8u10s06no4yy0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fc5ep2bv8u10s06no4yy0.png" alt=" " width="764" height="1704"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All of this is context live, environmental, dynamic information that allows the vehicle to make intelligent decisions in the moment, not just based on pre-loaded instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An AI agent has the same problem. The base LLM is trained on historical data.&lt;/strong&gt; It doesn't know about your &lt;strong&gt;enterprise data&lt;/strong&gt;, &lt;strong&gt;your customer's current account status&lt;/strong&gt;, the document that was updated yesterday, or the conversation that happened last week.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; A banking support agent is asked "what's the status of my loan application?" The model knows everything about loans in general. It knows nothing about this customer's application filed three days ago. Without retrieval &lt;strong&gt;RAG pulling the customer's record in real time the agent either hallucinates a status or says it doesn't have access&lt;/strong&gt;. Both outcomes destroy trust. The model is fine. The context layer wasn't built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Engineering fills that gap. It's how you inject:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;RAG and GraphRAG&lt;/strong&gt; — retrieval of relevant documents and structured knowledge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory systems&lt;/strong&gt; — both short-term (within session) and long-term (across sessions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Servers&lt;/strong&gt; — access to external tools, APIs, and services&lt;/li&gt;
&lt;li&gt;Enterprise knowledge bases — internal policies, product documentation, historical data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;User history and preferences&lt;/strong&gt; — the personalization layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time data feeds&lt;/strong&gt; — current state of the world the agent is operating in&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Context is not a prompt engineering problem. It's an infrastructure problem.&lt;/strong&gt; Getting the right information to the agent at the right moment, in the right format, with the right freshness that's an entirely different discipline with its own architecture, its own tooling, and its own failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A well prompted agent with poor context is like a skilled driver in a blindfolded car.&lt;/strong&gt; The instructions are clear. The execution is impossible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3: Harness Engineering = Safety, Recovery, and Accountability
&lt;/h2&gt;

&lt;p&gt;Here's where most teams underinvest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even the most advanced autonomous vehicle isn't deployed without a full safety stack.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Collision detection and emergency braking&lt;/li&gt;
&lt;li&gt;Lane departure warnings&lt;/li&gt;
&lt;li&gt;Route recalculation when roads are blocked&lt;/li&gt;
&lt;li&gt;Telemetry for monitoring vehicle state&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Black-box logging for post-incident investigation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human override capability&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Regulatory compliance systems&lt;/li&gt;
&lt;li&gt;Redundant sensor fusion&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzvv8abe219kzb0uopz9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzvv8abe219kzb0uopz9p.png" alt=" " width="800" height="1667"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the harness — the layer that doesn't make the car smarter, but makes it safer. It's the layer that catches failures before they become disasters, and that proves what happened when they do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Harness Engineering is the same idea applied to AI systems&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State Management&lt;/strong&gt; — knowing where the agent is in a multi-step workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpointing&lt;/strong&gt; — saving progress so failures don't require starting over&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop (HITL)&lt;/strong&gt; — escalation paths when confidence is low or stakes are high&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — traces, logs, and dashboards that show you what the agent did and why&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guardrails and Content Controls&lt;/strong&gt; — preventing harmful or out-of-scope outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Access Control&lt;/strong&gt; — scoping what the agent can call and with what permissions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation Pipelines&lt;/strong&gt; — continuous testing against ground truth to catch regression&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery Logic&lt;/strong&gt; — graceful degradation when tools fail or context is unavailable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security and Governance&lt;/strong&gt; — audit trails, access controls, compliance hooks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; An HR onboarding agent is mid-workflow — it has created a user account, sent a welcome email, and is about to provision software licenses when the identity service times out. Without checkpointing, the entire workflow restarts from scratch: duplicate account, duplicate email, confused new hire. Without observability, the engineering team doesn't even know it happened until someone complains. The model executed perfectly. The harness wasn't there to catch the infrastructure failure.&lt;/p&gt;

&lt;p&gt;The harness doesn't change what the agent can do. It changes what the agent will do under pressure  which is when it matters most.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Failures Still Happen Even When You've Done Everything Right
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcgeygfoyqjoc8an0ngg2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcgeygfoyqjoc8an0ngg2.png" alt=" " width="799" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's the truth every production AI team eventually confronts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even with all three layers in place&lt;/strong&gt; solid &lt;strong&gt;prompts&lt;/strong&gt;, rich &lt;strong&gt;context&lt;/strong&gt;, a well engineered &lt;strong&gt;harness&lt;/strong&gt; your &lt;strong&gt;agent will still make mistakes. Not occasionally.&lt;/strong&gt; Regularly enough that you need a plan for it.&lt;/p&gt;

&lt;p&gt;This is not a model quality problem. &lt;strong&gt;It is a fundamental property of the environment these systems operate in.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both autonomous vehicles and AI agents face the same four realities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic environments&lt;/strong&gt; — the world changes faster than any training set or prompt update cycle can track&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incomplete information&lt;/strong&gt; — no matter how good your retrieval is, the context is always partial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unseen edge cases&lt;/strong&gt; — production traffic will surface combinations that no benchmark, red team, or test suite anticipated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cascading conditions&lt;/strong&gt; — two situations your agent handles perfectly in isolation can combine into something it has never encountered&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No amount of engineering eliminates these realities. What engineering does is change how you respond to them.&lt;/p&gt;

&lt;p&gt;You can have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clear, tested prompts&lt;/li&gt;
&lt;li&gt;Rich, well-curated context&lt;/li&gt;
&lt;li&gt;A well-designed harness with observability and recovery&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And the agent will still make mistakes. The difference is whether those mistakes are visible, recoverable, and traceable — or silent, destructive, and impossible to debug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal is never zero failures. The goal is:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Detect failures earlier. Recover faster. Prove what happened. Continuously improve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's what the harness is for. That's what observability is for. That's what HITL is for.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;If someone asks you to explain all three disciplines in a single breath&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftzgwpgh8qftlru2m8rm4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ftzgwpgh8qftlru2m8rm4.png" alt=" " width="706" height="666"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering&lt;/strong&gt; tells the agent where to go. &lt;strong&gt;Context Engineering&lt;/strong&gt; helps it understand where it is. &lt;strong&gt;Harness Engineering&lt;/strong&gt; helps it arrive safely, recover when things go wrong, and prove what happened along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for Enterprise AI Teams
&lt;/h2&gt;

&lt;p&gt;Most teams are over invested in &lt;strong&gt;Layer 1&lt;/strong&gt; and under invested in &lt;strong&gt;Layers 2 and 3&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Engineering&lt;/strong&gt; gets the most attention because it's visible, iterable, and produces immediate results. It's also the layer that impresses in demos. &lt;strong&gt;Context Engineering&lt;/strong&gt; is harder because it requires data infrastructure, retrieval pipelines, and integration work. &lt;strong&gt;Harness Engineering&lt;/strong&gt; is hardest because it requires thinking about failure modes before they happen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fiu3u2o231y1w8ubfkp5z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fiu3u2o231y1w8ubfkp5z.png" alt=" " width="800" height="653"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But here's the practical reality: in production, the agents that stay in production are the ones with solid harnesses. Not the ones with the most creative prompts.&lt;/p&gt;

&lt;p&gt;The teams that deploy reliably aren't just asking "did the agent get the right answer?" They're asking "when it gets the wrong answer, how fast do we know? How do we recover? What's the audit trail? Who can intervene?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the shift from building demos to building systems.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The autonomous vehicle analogy works because it shifts the conversation from capability to reliability. Nobody debates whether self-driving cars are technically impressive. The debate is always about whether they're trustworthy enough to operate at scale without human supervision.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7y5sa9zcwzqaelascd7z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F7y5sa9zcwzqaelascd7z.png" alt=" " width="786" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's exactly where enterprise AI is right now.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The LLMs are impressive. The question is whether the systems around them are engineering grade.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt&lt;/strong&gt;, &lt;strong&gt;Context&lt;/strong&gt;, and &lt;strong&gt;Harness&lt;/strong&gt; Engineering are how you close that gap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>SAGA Made Microservices Reliable. Agent Harness Makes AI Agents Reliable.</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sun, 14 Jun 2026 05:31:13 +0000</pubDate>
      <link>https://dev.to/sreeni5018/saga-made-microservices-reliable-agent-harness-makes-ai-agents-reliable-3d1k</link>
      <guid>https://dev.to/sreeni5018/saga-made-microservices-reliable-agent-harness-makes-ai-agents-reliable-3d1k</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;The distributed systems world solved long-running transactions with SAGA. The agentic AI world has a harder version of the same problem. Here's how Agent Harness answers it.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwr7baaekdxt8bv6e1px5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwr7baaekdxt8bv6e1px5.png" alt=" " width="800" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I've been deep in agentic AI architecture for a while now &amp;amp; building &lt;strong&gt;Digital Workers&lt;/strong&gt;, designing &lt;strong&gt;multi-agent systems&lt;/strong&gt;, working through the messy production realities of agents that call tools, consult knowledge bases, and loop back on themselves when they're uncertain. And one question keeps coming up when I talk to engineers who come from a microservices background: "Can't we just use SAGA for this?"&lt;/p&gt;

&lt;p&gt;It's a fair question. &lt;strong&gt;SAGA is one of the more elegant patterns in distributed systems&lt;/strong&gt;. And on the surface, agentic workflows look similar enough that the analogy is tempting. Both involve coordinating multi-step processes. Both need state management and failure recovery. Both have to deal with partial completions.&lt;/p&gt;

&lt;p&gt;But the moment you dig into the details, you realize why SAGA alone isn't enough and why Agent Harness exists.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SAGA Was Built to Solve
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If you've spent time in microservices land&lt;/strong&gt;, you've lived this problem. &lt;strong&gt;Service A completes, Service B completes, Service C fails&lt;/strong&gt; and now you have a half-committed distributed transaction with no clean rollback and no database level guarantee to save you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SAGA pattern was invented for exactly this.&lt;/strong&gt; The break long-running transactions into a &lt;strong&gt;sequence of local steps&lt;/strong&gt;, and for every step that can succeed, write a compensating action in advance so that if something downstream fails, you can undo the damage cleanly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figy3ref019ifr4dfaq1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Figy3ref019ifr4dfaq1n.png" alt=" " width="800" height="660"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It works beautifully because microservices &lt;strong&gt;operate in a deterministic world. Every service has a known API contract&lt;/strong&gt;. Every response has a typed schema. Every failure is a status code or a typed exception. Every retry is predictable. The failure modes are knowable at design time, so you can write compensation logic at design time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI agents don't live in that world.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F756xte0qz9roovct41y2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F756xte0qz9roovct41y2.png" alt=" " width="800" height="658"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Is Probabilistic, Not Deterministic
&lt;/h2&gt;

&lt;p&gt;Here's what &lt;strong&gt;fundamentally changes&lt;/strong&gt; when you move from &lt;strong&gt;microservices&lt;/strong&gt; to &lt;strong&gt;agentic AI systems&lt;/strong&gt;, your "&lt;strong&gt;services&lt;/strong&gt;" are now &lt;strong&gt;LLM calls&lt;/strong&gt;, &lt;strong&gt;tool invocations&lt;/strong&gt;, &lt;strong&gt;knowledge retrievals&lt;/strong&gt;, &lt;strong&gt;external APIs or MCP Server tool calls **, and **increasingly&lt;/strong&gt;  &lt;strong&gt;human approvals&lt;/strong&gt;. None of these behave like a well defined &lt;strong&gt;REST&lt;/strong&gt; endpoint with a contract you can write compensation logic against.&lt;/p&gt;

&lt;p&gt;An LLM call can return an answer that passes every syntax check but is semantically wrong confidently, fluently, plausibly wrong. A tool call might succeed at the HTTP layer but return data that sends the agent down an entirely incorrect reasoning path. A multi-step task might "complete" having taken three hallucinated intermediate steps before landing somewhere that superficially looks like the goal.&lt;/p&gt;

&lt;p&gt;And here's the part that should give you pause: &lt;strong&gt;a SAGA coordinator would mark all of that as success&lt;/strong&gt;. No exceptions. No compensation triggered. Workflow complete.&lt;/p&gt;

&lt;p&gt;Retrying won't fix it. Compensation logic won't fix it. You need something architecturally different: an &lt;strong&gt;Agent Harness&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SAGA and Agent Harness Actually Share
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7olhfu6sqpym0c1hw3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7olhfu6sqpym0c1hw3x.png" alt=" " width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before getting into where they diverge&lt;/strong&gt;, it's worth being honest about the parallel because &lt;strong&gt;it isn't just a clever analogy. It's structurally real.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both patterns exist to solve the same core problem: coordinating multi-step processes where individual steps can fail, state needs to be preserved across the lifecycle, and the overall system needs to recover gracefully when things go sideways.&lt;/p&gt;

&lt;p&gt;The SAGA Coordinator manages: &lt;strong&gt;state tracking&lt;/strong&gt;, &lt;strong&gt;retries&lt;/strong&gt;, &lt;strong&gt;compensation actions&lt;/strong&gt;, &lt;strong&gt;failure recovery&lt;/strong&gt;, workflow sequencing, and distributed reliability. The Agent Harness manages all of those same things just mapped to a completely different execution model.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;[The architecture maps cleanly. The implementation is night and day.]&lt;/strong&gt;&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9v973c2uj521fias69f0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9v973c2uj521fias69f0.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agent Harness Does That SAGA Cannot
&lt;/h2&gt;

&lt;p&gt;SAGA assumes your workflow steps are atomic and deterministic. Agent Harness has to deal with steps that are neither. That's why it needs an entire category of capabilities that have no real SAGA equivalent:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory (Short &amp;amp; Long Term):&lt;/strong&gt; An agent working a multi-turn task needs to remember what it decided three steps ago, what the user said at the start, and what it already tried that didn't work. That's not transaction state. That's episodic memory and working context interleaved in a way that needs to survive tool calls, retries, and mid-task handoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reflection &amp;amp; Critique:&lt;/strong&gt; Before committing to an action or an answer, a well designed harness routes the &lt;strong&gt;agent's proposed output through a&lt;/strong&gt; &lt;strong&gt;self critique step&lt;/strong&gt;. Did the answer actually address the stated goal? Does it contradict something established earlier in the session? Does it fall outside the policy boundaries? SAGA never needs to ask its services whether they feel confident about their output. Agent Harness does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails &amp;amp; Policies:&lt;/strong&gt; In production especially in regulated industries you &lt;strong&gt;don't want an agent calling a sensitive external API, accessing PII, or making a consequential decision without policy enforcement at the harness level.&lt;/strong&gt; This isn't exception handling after the fact. It's proactive constraint evaluation before execution. I've seen this matter enormously in healthcare projects where the consequences of an unguarded tool call are real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-Loop:&lt;/strong&gt;  SAGA runs unattended by design. Agent Harness needs to know when to stop and ask a human and that decision happens at the semantic level, not the infrastructure level. &lt;strong&gt;"I'm not certain this is what the user intended" is a fundamentally different pause condition than "the API returned a 503."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation &amp;amp; Validation:&lt;/strong&gt;  Did the &lt;strong&gt;agent's output actually achieve the goal? Not "did the tool call succeed"&lt;/strong&gt; did we actually do what we set out to do? This requires goal level evaluation, not just a &lt;strong&gt;success/failure&lt;/strong&gt; bit. It's one of the harder things to operationalize in practice, but skipping it is how you ship agents that complete tasks without accomplishing goals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost &amp;amp; Token Monitoring:&lt;/strong&gt; LLM calls have &lt;strong&gt;variable cost depending on context length, model tier, and how deep the reasoning goes&lt;/strong&gt;. An agent running a complex multi-step task can burn through budget in ways that are invisible until you get the bill. A production Agent Harness needs token spend guardrails the way a microservices platform needs circuit breakers on latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Durable Execution via Checkpointing:&lt;/strong&gt;  If an &lt;strong&gt;agent task runs for 40 minutes and the process crashes at minute 39, checkpointing lets you resume from the last stable state rather than starting over&lt;/strong&gt;. Philosophically similar to SAGA's compensating transactions but the implementation means serializing agent state, tool call history, memory contents, and intermediate reasoning. Substantially more complex, and substantially more necessary for long horizon tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflchvwqq9l5j9efcf30j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fflchvwqq9l5j9efcf30j.png" alt=" " width="800" height="1274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Concrete Scenario That Makes This Real
&lt;/h2&gt;

&lt;p&gt;Let me give you a specific example, because abstract architecture arguments only go so far.&lt;/p&gt;

&lt;p&gt;Imagine an agent tasked with: "&lt;strong&gt;Research our top three competitors&lt;/strong&gt;' pricing pages and prepare a comparison summary for the sales team."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A SAGA style system would model this as&lt;/strong&gt;: &lt;strong&gt;call tool to fetch Page A → call tool to fetch Page B → call tool to fetch Page C → call tool to generate summary → done.&lt;/strong&gt; If any fetch fails, compensate. If all fetches succeed, the workflow completes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here's what can actually happen&lt;/strong&gt;: Page B returns a &lt;strong&gt;cached version from 2 months ago&lt;/strong&gt;. The agent doesn't know that it just sees valid HTML. It processes the outdated pricing as current. The summary it generates is factually wrong in a way that could embarrass your sales team.&lt;/p&gt;

&lt;p&gt;Every step "&lt;strong&gt;succeeded&lt;/strong&gt;." The SAGA coordinator marks it complete. No compensation triggered. &lt;strong&gt;And your sales team walks into a meeting with incorrect competitive data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Harness addresses this at multiple layers&lt;/strong&gt;. &lt;strong&gt;Reflection&lt;/strong&gt; &lt;strong&gt;catches&lt;/strong&gt; that the &lt;strong&gt;retrieved content has anomalous&lt;/strong&gt; date markers. Evaluation validates whether the output meets the quality criteria defined for the task. &lt;strong&gt;Guardrails can flag when retrieved content falls below a freshness threshold&lt;/strong&gt;. &lt;strong&gt;Human-in-the-loop&lt;/strong&gt; escalation routes the uncertainty to a person rather than silently proceeding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the gap. And it's not a small one.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Key Difference, Plainly Said
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SAGA&lt;/strong&gt; manages &lt;strong&gt;deterministic workflows&lt;/strong&gt;. &lt;strong&gt;Agent&lt;/strong&gt; Harness manages &lt;strong&gt;probabilistic workflows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9su9o1wxmabms1qlxun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw9su9o1wxmabms1qlxun.png" alt=" " width="800" height="581"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In SAGA, failure modes are knowable at design time&lt;/strong&gt;. You write compensation logic once and trust it to cover the cases. In an Agent Harness, failure can mean: the tool returned a valid response that the agent misread. Or the agent completed every step correctly but arrived at a goal that doesn't satisfy what the user actually wanted. Or the agent is in a soft reasoning loop, &lt;strong&gt;re-checking the same condition because it's genuinely uncertain and nobody told it when to escalate.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Handling that requires reflection, self critique, goal validation, and graceful human escalation none of which exist in the SAGA vocabulary, because SAGA was never designed for an execution unit that reasons about the world.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means If You're Building Today
&lt;/h2&gt;

&lt;p&gt;If you're designing an agentic system and you're thinking purely in SAGA terms, you're probably building something that's reliable at the infrastructure layer but brittle at the reasoning layer. Your agents will retry correctly. They'll compensate correctly. But they'll also confidently produce wrong answers, hallucinate tool results, and mark tasks complete that aren't — and your coordinator will have no way to know the difference.&lt;/p&gt;

&lt;p&gt;Agent Harness is the layer that closes that gap. It's not a replacement for orchestration. It sits above orchestration and asks: did we actually do the right thing, in the right way, within the right constraints, with the appropriate level of human oversight?&lt;/p&gt;

&lt;p&gt;The engineers who built SAGA were solving a genuinely hard distributed systems problem. The people building Agent Harness today are solving a harder version of it because the failure modes are less visible, the state is messier, and "success" is much harder to define when your execution unit is a language model reasoning about an open-ended goal.&lt;/p&gt;

&lt;p&gt;But the spirit is exactly the same: &lt;strong&gt;build systems that fail gracefully, recover intelligently, and complete what they started&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  SAGA made microservices reliable. Agent Harness is what makes AI agents reliable.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  One Question Worth Sitting With
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Of all the Agent Harness components&lt;/strong&gt;, I've found that &lt;strong&gt;Reflection&lt;/strong&gt; &amp;amp; &lt;strong&gt;Critique&lt;/strong&gt; and &lt;strong&gt;Human-in-the-Loop&lt;/strong&gt; are the &lt;strong&gt;two&lt;/strong&gt; that teams &lt;strong&gt;most consistently underinvest&lt;/strong&gt; in usually because they're harder to wire up than checkpointing or token monitoring, and the cost of skipping them isn't visible until something goes wrong in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which component do you find hardest to implement in practice  and how are you handling it?&lt;/strong&gt; I'm genuinely curious what patterns the community is landing on. Drop it in the comments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyzrd8344y3d2nhumj3j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyzrd8344y3d2nhumj3j.png" alt=" " width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>AI Agents Are the New Microservices &amp; A2A Is Their HTTP(s)</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Fri, 29 May 2026 23:14:43 +0000</pubDate>
      <link>https://dev.to/sreeni5018/ai-agents-are-the-new-microservices-a2a-is-their-https-329g</link>
      <guid>https://dev.to/sreeni5018/ai-agents-are-the-new-microservices-a2a-is-their-https-329g</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;As enterprises race to deploy generative AI Apps/Agents&lt;/strong&gt;, the hardest question isn't "&lt;strong&gt;which foundation model do we use?.&lt;/strong&gt;" it's "how do they &lt;strong&gt;safely talk to each other?&lt;/strong&gt;"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you spent the 2010s building distributed systems&lt;/strong&gt;, the architectural blueprints emerging for enterprise AI will feel strangely familiar. &lt;strong&gt;Bounded contexts&lt;/strong&gt;, &lt;strong&gt;service registries&lt;/strong&gt;, async message queues, and distributed tracing are all back. The vocabulary is almost identical  &lt;strong&gt;except our "services" now reason in natural language, call tools, and produce probabilistic, context-aware outputs instead of deterministic ones.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Agent-to-Agent (A2A) Protocol&lt;/strong&gt; is the open-standard transport and interface layer that makes this architectural analogy concrete. And ,the protocol now has support from more than 150 organizations  including Salesforce, &lt;strong&gt;Microsoft, SAP, Workday, PayPal, and LangChain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just as &lt;strong&gt;HTTP/REST became the lingua franca of Microservice&lt;/strong&gt; communication, A2A (now hosted under the Linux Foundation) standardizes how autonomous agents discover capabilities, delegate tasks, and maintain security boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining the Ecosystem: A2A vs. MCP
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2xjuo13y8o9s8aokp70.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2xjuo13y8o9s8aokp70.png" alt=" " width="800" height="187"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To design an enterprise multi agent mesh&lt;/strong&gt;, you must first separate agent orchestration from tool execution. &lt;strong&gt;A common architectural anti pattern is trying to force a single protocol to handle both.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;: This handles the &lt;strong&gt;Agent-to-Tool layer&lt;/strong&gt;. It standardizes how a single agent securely reads from local databases, hooks into enterprise storage, or accesses development environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent-to-Agent Protocol (A2A):&lt;/strong&gt; This handles the &lt;strong&gt;Agent-to-Agent layer&lt;/strong&gt;. It standardizes how separate, sovereign intelligent systems communicate with each other in their natural, semantic modalities (negotiating tasks, passing conversational state, or handing off workflows) across frameworks and lines of business.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key distinction:&lt;/strong&gt; MCP connects agents to tools (vertical integration). A2A connects agents to each other (horizontal integration). &lt;strong&gt;They are explicitly designed to be complementary&lt;/strong&gt;, not competitive. Together, they form the two-layer interoperability stack for modern multi-agent systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Under the Hood: How A2A Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmi497xmpc75tj6209auo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmi497xmpc75tj6209auo.png" alt=" " width="800" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before diving into communication styles, it helps to understand the technical foundation A2A is built on because it is deliberately not reinventing the wheel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A leverages well established web technologies.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP/HTTPS&lt;/strong&gt; — primary transport layer (production deployments require HTTPS with modern TLS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JSON-RPC 2.0&lt;/strong&gt; — structured data exchange format for all requests and responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server-Sent Events (SSE)&lt;/strong&gt; — real-time, one-way streaming of updates from agent to client&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every A2A agent publishes a small JSON document called an &lt;strong&gt;Agent Card&lt;/strong&gt;, typically served at &lt;strong&gt;/.well-known/agent.json.&lt;/strong&gt; This file lists the agent's identity, skills, endpoint URL, and authentication requirements — enabling zero-configuration discovery between agents without any proprietary registry or coordination layer.&lt;/p&gt;

&lt;p&gt;Security is baked in from the start. A2A incorporates enterprise-grade authentication and authorization mechanisms aligned with OpenAPI security schemes, including support for OAuth 2.0 and API keys passed via HTTP headers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four A2A Communication Styles
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2pegidu9a736gvefglmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2pegidu9a736gvefglmc.png" alt=" " width="800" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The A2A standard defines clear execution modes that mirror the structural communication patterns distributed systems engineers have relied on for decades.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. Synchronous (Blocking)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One agent sends a task and blocks its execution context until the responding agent returns a final artifact.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microservices Analogy:&lt;/strong&gt; A standard REST call (GET /resource).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Use Case:&lt;/strong&gt; Fast, critical path dependency queries like an Orchestrator agent requesting a real time risk compliance score before formatting a customer response.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. Asynchronous (Non-Blocking)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One agent dispatches a task object and immediately returns to other processing. The remote agent queues the work and processes it in the background.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microservices Analogy:&lt;/strong&gt; Message queues or event streams (Kafka, RabbitMQ).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Use Case:&lt;/strong&gt; Long-running cognitive tasks such as a Legal Agent reading a 400-page corporate acquisition contract or a Data Agent running complex batch classification.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. Streaming&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Continuous data tokens or partial states flow dynamically between agents in real time, rather than waiting for a single completed payload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microservices Analogy:&lt;/strong&gt; gRPC streaming or Server-Sent Events (SSE).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Use Case:&lt;/strong&gt; Real-time speech transcription agents feeding an analysis agent, or interactive multi-agent chat interfaces where UX requires instant token delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. Push Notifications (Event-Driven)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;An agent registers a web callback or subscription, receiving a proactive alert only when a specific upstream event or state change occurs. When significant task state changes happen such as &lt;strong&gt;completed&lt;/strong&gt;, &lt;strong&gt;failed&lt;/strong&gt;, or &lt;strong&gt;input-required&lt;/strong&gt; the server sends an asynchronous HTTP POST notification to the client's provided &lt;strong&gt;web hook.&lt;/strong&gt; This requires the server to declare push notification capability in its Agent Card.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microservices Analogy:&lt;/strong&gt; Web hooks or an Event Bus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Use Case:&lt;/strong&gt; Event-driven governance like an automated Compliance Agent waking up to audit a transaction only when an Account Agent drafts a contract exceeding $1M.&lt;/p&gt;

&lt;p&gt;Key Architectural Insight: A mature multi-agent enterprise system never forces a single interaction pattern. It builds a mesh that combines all four, leveraging an internal API gateway plane to manage traffic, route tasks, and handle fallback strategies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Critical Shift: From Deterministic to Semantic Interfaces
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0bmao16ta2vnqz1mtag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0bmao16ta2vnqz1mtag.png" alt=" " width="800" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In traditional microservices, the API contract is strictly &lt;strong&gt;deterministic&lt;/strong&gt;: Send these exact bytes, receive those exact bytes.&lt;/p&gt;

&lt;p&gt;In a multi-agent network, the interface is &lt;strong&gt;semantic&lt;/strong&gt;: Send this intent, receive a reasoned response.&lt;/p&gt;

&lt;p&gt;Instead of maintaining brittle endpoints for every hyper-specific query variation, an agent uses its &lt;strong&gt;Agent Card to advertise&lt;/strong&gt; its overall "Skills" and expected structural input/output schemas. A Finance agent capable of calculating remaining Q3 headcount budgets does not require a new API endpoint deployment when business users slightly pivot the nuance of the request; it interprets the intent via the A2A task lifecycle.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;"beating heart"&lt;/strong&gt; of this lifecycle is the task's input-required state, which allows agents to pause execution mid-task and request further information &lt;strong&gt;from clients or other agents something traditional REST APIs were simply never designed to do&lt;/strong&gt;. This makes agent conversations stateful and adaptive in a way that static Microservice contracts are not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;parallels&lt;/strong&gt; between the &lt;strong&gt;microservices&lt;/strong&gt; revolution of the 2010s and today's &lt;strong&gt;multi-agent AI ecosystem are not just cosmetic.&lt;/strong&gt; The same hard-won lessons around service discovery, security boundaries, async communication, and composable architecture are being relearned and encoded into open standards like A2A and MCP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A2A is an open standard that enables AI agents to discover&lt;/strong&gt;, communicate, and transact with each other across different frameworks, vendors, and platforms. MCP handles how each of those agents connects to its tools. Together they give architects a principled, two-layer model for building AI systems that are modular, interoperable, and production-ready.&lt;/p&gt;

&lt;p&gt;The momentum behind A2A growing from 50 launch partners to 150+ organizations in under a year underscores something simple  fragmentation in AI agent ecosystems is a problem the industry is collectively choosing to solve. For engineers building in this space today, the question is no longer whether these protocols matter. It's whether your architecture is ready for the systems around you that already use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>microservices</category>
    </item>
    <item>
      <title>The Agent Harness Taught Me Why I Used to Fail</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Thu, 28 May 2026 21:08:18 +0000</pubDate>
      <link>https://dev.to/sreeni5018/the-agent-harness-taught-mewhy-i-used-to-fail-39g1</link>
      <guid>https://dev.to/sreeni5018/the-agent-harness-taught-mewhy-i-used-to-fail-39g1</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;On building AI agents  and accidentally understanding yourself&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;We tend to believe that &lt;strong&gt;intelligence is the ultimate differentiator&lt;/strong&gt; that if we think clearly enough, know enough, and work hard enough, success follows. It's a comforting idea. It's also incomplete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I didn't fully understand that until I started building AI agents.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Specifically, it hit me while designing the Harness layer for a &lt;strong&gt;Digital Worker (AI Agent)&lt;/strong&gt; the architectural component responsible for &lt;strong&gt;orchestrating&lt;/strong&gt; tasks, &lt;strong&gt;managing&lt;/strong&gt; priorities, &lt;strong&gt;regulating&lt;/strong&gt; execution, and keeping the agent coherent across complex, multi-step workflows. The &lt;strong&gt;Harness isn't the brain. It isn't the memory. It's the discipline layer the scaffolding that ensures raw capability actually translates into reliable output.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And as I built it, I kept thinking: how many times in my own life did I have the intelligence, the knowledge, even the opportunity and still fall short?&lt;/p&gt;

&lt;p&gt;Not because I wasn't capable. But because I lacked exactly what the Harness provides orchestration, prioritization, emotional balance, structured execution, and the feedback loops to course-correct in real time.&lt;/p&gt;

&lt;p&gt;This blog is &lt;strong&gt;part technical exploration, part honest reflection.&lt;/strong&gt; Whether you are an engineer building intelligent systems, a leader navigating complexity, or simply someone trying to understand why effort alone doesn't always produce results the architecture of an AI agent has something surprising to say about the architecture of a human being.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The gap between potential and performance in agents and in people isn't usually about intelligence. It's about what holds everything together.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What is the Agent Harness and Why Does It Matter?&lt;/strong&gt;&lt;br&gt;
When most people discuss AI agents, the conversation gravitates toward the model, the memory, or the tools. These are the visible, exciting components the intelligence, the knowledge base, the capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;But the Harness layer is the real operational backbone.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It orchestrates tasks, manages priorities, controls execution flow, handles failures gracefully, applies guardrails, maintains context across long-running workflows, and prevents the agent from spiraling into chaos or stalling indefinitely. It is the operational nervous system that connects intelligence to consistent, reliable action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without a Harness, even the most capable AI agent becomes unpredictable.&lt;/strong&gt; It may perform brilliantly in controlled settings and collapse the moment conditions become complex, ambiguous, or adversarial. The model stays sharp. But the system breaks down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;That distinction between raw capability and disciplined execution is exactly what I want to explore here.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Personal Parallel
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Moment It Got Personal&lt;/strong&gt;&lt;br&gt;
While designing the Harness, something clicked that went beyond systems architecture.&lt;/p&gt;

&lt;p&gt;Many times in my life, &lt;strong&gt;I didn't fail because I lacked intelligence&lt;/strong&gt;, talent, or technical knowledge. I failed because I lacked orchestration. &lt;strong&gt;Clear prioritization. Emotional regulation. Structured execution. Feedback loops. Consistency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;The same things that break AI agents in production.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;That realization hit me harder than any architecture diagram ever could.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We often assume success comes purely from reasoning ability or memory both in humans and in AI.&lt;/strong&gt; But real-world execution depends on something deeper. Something that doesn't show up on a résumé or a benchmark score.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Principles
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr55m4bba9qnaz16fmbbt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr55m4bba9qnaz16fmbbt.png" alt=" " width="394" height="591"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Six Things That Break Agents and People&lt;br&gt;
&lt;strong&gt;Whether we are talking about enterprise AI systems or individual human performance, the failure points are strikingly similar.&lt;/strong&gt; Real world execution demands all six of these and notably, &lt;strong&gt;four of them map directly to the core components of the Agent Harness.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1.Managing Overload [Context]
&lt;/h2&gt;

&lt;p&gt;Knowing what is relevant now without drowning in everything at once. Context overload collapses both agents and people the &lt;strong&gt;harness enforces what stays in scope.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2.Using the Right Capability [Tool]
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Knowing which tool, skill, or resource to deploy and when&lt;/strong&gt;. Raw access to capabilities means nothing without the judgment to use them correctly under pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  3.Recovering from Failure [Loop]
&lt;/h2&gt;

&lt;p&gt;Completing &lt;strong&gt;feedback loops&lt;/strong&gt; detecting &lt;strong&gt;what went wrong, adjusting&lt;/strong&gt;, and trying again. Without loops, &lt;strong&gt;both agents and people keep repeating the same mistakes.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4.Staying Within Bounds [Governance]
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Applying guardrails that prevent drift ethical, operational, and behavioral.&lt;/strong&gt; Governance is not a constraint on performance; it is the condition for trust.&lt;/p&gt;

&lt;h2&gt;
  
  
  5.Prioritization
&lt;/h2&gt;

&lt;p&gt;Knowing what matters now versus later. Without clear prioritization, effort gets scattered, urgency becomes noise, and the most important things rarely get done.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Repeatable Execution
&lt;/h2&gt;

&lt;p&gt;Building patterns that hold up consistently not just when conditions are ideal. Discipline is what turns one-time performance into reliable delivery over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;These are not soft skills.&lt;/strong&gt; They are not secondary concerns. In &lt;strong&gt;production AI systems&lt;/strong&gt;, failing at any one of these causes real operational breakdowns. &lt;strong&gt;And in life, the story is no different.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  A Broader Reflection
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;What Software Engineering Quietly Teaches You&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
The strange thing about software engineering is that if you stay in it long enough, it reshapes how you think about yourself slowly, without announcement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building distributed systems teaches patience.&lt;/strong&gt; You learn that complex things fail in non-obvious ways, that the answer is rarely where you first looked, and that premature conclusions are more dangerous than no conclusion at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Debugging teaches humility.&lt;/strong&gt; Every session is a reminder that your mental model of reality is incomplete. The bug isn't in the code  it's in the assumption you forgot you were making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Designing AI agents teaches self-awareness.&lt;/strong&gt; Because you are not just &lt;strong&gt;modeling intelligence.&lt;/strong&gt; You are modeling the entire operating system of a functioning entity how it perceives, decides, acts, recovers, and adapts. And somewhere in that process, you start to see yourself reflected back.&lt;/p&gt;

&lt;p&gt;The Agentic AI systems we build are not mirrors. But they are close enough to matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;I Wasn't Just Building a Control Layer for an AI Maybe that is why designing the Agent Harness feels so strangely personal.&lt;/p&gt;

&lt;p&gt;I wasn't just architecting a component that manages &lt;strong&gt;workflow state, enforces guardrails, and ensures execution coherence.&lt;/strong&gt; I was finally articulating something I had lived through but never quite named  the difference between having capability and having the structure to deploy it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Harness doesn't make an agent smarter.&lt;/strong&gt; It makes the agent's &lt;strong&gt;intelligence&lt;/strong&gt; &lt;strong&gt;usable&lt;/strong&gt;, &lt;strong&gt;consistent&lt;/strong&gt;, and &lt;strong&gt;trustworthy&lt;/strong&gt; under real-world pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is what personal growth looks like too.&lt;/strong&gt; Not acquiring more intelligence. Not gathering more memory or more tools. But building the internal structure that allows everything you already have to work together, consistently, under pressure, over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The deeper I go into Agentic AI&lt;/strong&gt;, the more I believe this: the most important breakthroughs are not always about capability. Sometimes, they are about architecture.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Intelligence without orchestration is potential without performance. The harness is not a constraint  it is the condition for everything else to work.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I started this by adding a &lt;strong&gt;Harness to an AI agent&lt;/strong&gt;.&lt;br&gt;
&lt;strong&gt;I ended it wondering who's going to add one to me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Transformers &amp; Agile Sprints: The Art of Incremental Evolution</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Wed, 27 May 2026 21:04:09 +0000</pubDate>
      <link>https://dev.to/sreeni5018/transformers-agile-sprints-the-art-of-incremental-evolution-3411</link>
      <guid>https://dev.to/sreeni5018/transformers-agile-sprints-the-art-of-incremental-evolution-3411</guid>
      <description>&lt;p&gt;Ever wonder why &lt;strong&gt;Transformer models are so incredibly effective at scaling?&lt;/strong&gt; It turns out they share a fundamental philosophy with modern software engineering: &lt;strong&gt;they never build from scratch.&lt;/strong&gt; In machine learning, &lt;strong&gt;Residual Connections&lt;/strong&gt; (or skip connections) act as an information bridge. Instead of forcing a neural network to completely reinvent its intelligence at every single layer, the model simply &lt;em&gt;adds&lt;/em&gt; new insights to what it already knows. It preserves the foundational knowledge, preventing data from degrading as it goes deeper.&lt;/p&gt;

&lt;p&gt;Sound familiar? That is exactly how high-performing &lt;strong&gt;Agile teams&lt;/strong&gt; operate.&lt;/p&gt;

&lt;p&gt;Instead of waiting for a single, massive &lt;strong&gt;"grand plan"&lt;/strong&gt; &lt;strong&gt;release&lt;/strong&gt;, Agile teams enhance a working product sprint by sprint. You deliver value incrementally, gather feedback, and iterate without tearing down the core infrastructure you already built.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoi6mxca1ay95a2z3f5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoi6mxca1ay95a2z3f5k.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🧠 Deep Dive: How Residual Connections Save Deep Transformers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;To truly appreciate this parallel, look at what happens inside the Transformer architecture.&lt;/strong&gt; As models grow to dozens or hundreds of layers, they face two massive technical hurdles: &lt;strong&gt;Vanishing Gradients&lt;/strong&gt; and &lt;strong&gt;Information Degradation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Without residual connections, the raw input signal gets warped and lost the deeper it travels through &lt;strong&gt;self-attention&lt;/strong&gt; and &lt;strong&gt;feed-forward networks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Residual connections solve this by changing the fundamental mathematical objective of a layer. Instead of forcing a layer to learn an entirely new mapping $H(x)$, the layer only has to learn a residual mapping &lt;strong&gt;$F(x) = H(x) - x$.&lt;/strong&gt; The final output of the block becomes:&lt;/p&gt;

&lt;h2&gt;
  
  
  $$𝖮𝗎𝗍𝗉𝗎𝗍 = F(x) + x$$
&lt;/h2&gt;

&lt;p&gt;By adding the original input $x$ directly to the output of the sub-layer, Transformers gain two massive engineering advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unobstructed Gradient Flow:&lt;/strong&gt; During &lt;strong&gt;back propagation&lt;/strong&gt;, the gradient can flow directly through the skip connection without being altered or diminished by the layer's weights. This completely mitigates the vanishing gradient problem, allowing us to train models with hundreds of layers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Feature Preservation:&lt;/strong&gt; The identity shortcut ensures that the core semantic meaning established in early layers isn't corrupted or forgotten by complex attention calculations later in the stack.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Core Parallel
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Layer vs. The Sprint:&lt;/strong&gt; A neural network layer computes incremental feature adjustments ($F(x)$) while maintaining the input foundation ($x$); an Agile sprint delivers incremental feature updates while maintaining the stable application baseline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Foundation:&lt;/strong&gt; Residual connections pass raw data forward so deep networks don't lose their identity or variance. Agile version control and MVP architecture ensure teams don't lose sight of the core product value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Goal:&lt;/strong&gt; Both systems leverage previous successes to achieve complex, sophisticated outcomes faster and with less risk of systemic failure.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stop trying to rebuild the wheel at every stage of development whether you are training a billions-parameter model or leading a cross functional engineering team. Build the foundation, protect it, and iterate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Your LLM Is Not an Agent. Your Framework Is Not Enough. You Need a Harness.</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Mon, 25 May 2026 06:20:59 +0000</pubDate>
      <link>https://dev.to/sreeni5018/your-llm-is-not-an-agent-your-framework-is-not-enough-you-need-a-harness-321j</link>
      <guid>https://dev.to/sreeni5018/your-llm-is-not-an-agent-your-framework-is-not-enough-you-need-a-harness-321j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Every team building with AI agents hits the same wall.&lt;/strong&gt; The demo works beautifully. The agent answers questions, calls tools, produces results. Then you ship it and the cracks appear it loses track of what it was doing, &lt;strong&gt;burns through API calls in circles&lt;/strong&gt;, ignores boundaries it should respect, &lt;strong&gt;forgets context from five minutes ago.&lt;/strong&gt; Users lose trust. Engineers lose sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is not a model problem.&lt;/strong&gt; The LLM is capable. It's an infrastructure problem. The agent has a brain but no operating environment no structured loop to run in, no memory to draw on, no rules to constrain it, no way to resume where it left off. You gave it intelligence without giving it a way to apply that intelligence reliably.&lt;/p&gt;

&lt;p&gt;That operating environment is called a &lt;strong&gt;Harness&lt;/strong&gt;. And it's what separates a &lt;strong&gt;demo agent from one you'd actually trust in production.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What breaks without a harness
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🔁 Infinite loops or premature stops.&lt;/strong&gt; The agent has no governing loop  it either runs forever or halts before the task is done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🧠 Context amnesia.&lt;/strong&gt; Long tasks overflow the context window. The agent loses the thread and starts hallucinating or repeating itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💾 No memory between sessions.&lt;/strong&gt; Every conversation starts from zero. Multi-step, multi-day workflows are impossible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔧 Tool failures cascade.&lt;/strong&gt; One flaky API brings the whole agent down because there's no error handling layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🚨 No guardrails.&lt;/strong&gt; The agent touches system it should not.&lt;/p&gt;

&lt;h2&gt;
  
  
  You're Already Using the Pieces. A Harness Is How You Make Them Work Together.
&lt;/h2&gt;

&lt;p&gt;If you've been building AI agents for a while, you know the drill. You pick a framework &lt;strong&gt;CrewAI&lt;/strong&gt;, &lt;strong&gt;LangGraph&lt;/strong&gt;, &lt;strong&gt;Strands&lt;/strong&gt;, &lt;strong&gt;Microsoft Agent Framework&lt;/strong&gt;  and you start wiring things up. You &lt;strong&gt;add memory so the agent remembers things.&lt;/strong&gt; &lt;strong&gt;You register tools so it can take actions.&lt;/strong&gt; You configure guardrails so it doesn't go off the rails. You set up a loop so it keeps working until the task is done.&lt;/p&gt;

&lt;p&gt;And it works. Mostly. In development, in demos, in controlled tests.&lt;/p&gt;

&lt;p&gt;Then you put it in front of real users, with real tasks, over real time and you start seeing the cracks. The agent forgets things it shouldn't. &lt;strong&gt;It handles a task perfectly on Monday and fumbles the same task on Thursday.&lt;/strong&gt; Two similar agents behave inconsistently. A tool fails and the whole run degrades silently. You added all the right pieces but somehow the whole is less than the sum of its parts.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;This is the problem a harness solves. And here's the key thing to understand.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A harness doesn't replace your framework.&lt;/strong&gt; You're not choosing between them. Your &lt;strong&gt;framework gives you the ingredients&lt;/strong&gt;  memory, tools, loops, guardrails. The harness is the &lt;strong&gt;recipe&lt;/strong&gt; the deliberate architectural decisions about &lt;strong&gt;how those ingredients&lt;/strong&gt; are &lt;strong&gt;assembled&lt;/strong&gt;, coordinated, and governed so your agent behaves consistently every single time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it like building a house.&lt;/strong&gt; The framework is lumber, concrete, wiring, plumbing  everything you need. &lt;strong&gt;The harness is the blueprint&lt;/strong&gt; and the construction process which material goes where, in what order, connected how, inspected by whom. Without a blueprint, you might still end up with a structure. But it probably won't hold up when the weather turns.&lt;/p&gt;

&lt;h2&gt;
  
  
  The PM &amp;amp; Developer Analogy
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ksawk8x827lr6c2dufn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ksawk8x827lr6c2dufn.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here's a mental model that makes this concrete. &lt;strong&gt;In a software team,&lt;/strong&gt; a &lt;strong&gt;Product Manager writes a story. It has context, a clear task, acceptance criteria, and scope boundaries.&lt;/strong&gt; A Developer picks it up and delivers it. But the developer doesn't just start typing  they follow a process. They use version control, a build system, coding standards, and a defined way to ask for help or escalate a blocker. &lt;strong&gt;That process is what makes delivery reliable, not just the developer's raw talent.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now replace the developer with an AI Agent.&lt;/strong&gt; The PM's story is the &lt;strong&gt;task prompt.&lt;/strong&gt; The &lt;strong&gt;agent is the developer.&lt;/strong&gt; The &lt;strong&gt;harness is the process&lt;/strong&gt; the structured operating environment that governs how the agent reads the story, uses its tools, manages its memory, escalates when stuck, and knows when it's truly done.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf8z8wwcsc98quexydxu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf8z8wwcsc98quexydxu.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;The framework puts the tools in the developer's hands. The harness defines how the developer uses them consistently, safely, and with the right behavior for each situation.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Framework vs. Harness: Ingredient vs. Recipe
&lt;/h2&gt;

&lt;p&gt;Here's where most explanations go wrong they imply frameworks are incomplete or that you shouldn't use them. That's backwards. Frameworks are excellent. They just operate at a different layer than a harness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6fydfkm2d14a34wchkt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx6fydfkm2d14a34wchkt.png" alt=" " width="800" height="557"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can have every framework primitive&lt;/strong&gt; in place and still have an unreliable agent because nobody made the architectural decisions about how they work together. &lt;strong&gt;That's the gap the harness fills.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decisions a Harness Makes
&lt;/h2&gt;

&lt;p&gt;Every harness whether you've named it that or not is making below architectural decisions. Here's what each one actually means, and why it's a decision rather than just a feature you turn on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Thinking Loop Not just running, but knowing when to stop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Every framework gives you a loop.&lt;/strong&gt; The &lt;strong&gt;harness decides&lt;/strong&gt; the rules of that loop what counts as "&lt;strong&gt;done&lt;/strong&gt;," &lt;strong&gt;how many iterations&lt;/strong&gt; are too many, how to detect when the agent is stuck in circles, and when to break out and surface an error. Without these rules, your loop either exits too early or runs until your API bill catches fire.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: the loop mechanism. Harness decides: the exit conditions, the stuck-detection logic, the iteration limits.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Working Memory Not just storing, but knowing what to keep
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Context management&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A context window is finite.&lt;/strong&gt; As a task runs across many turns, old information competes with new information for that space. The harness makes the call: what gets summarized, what gets evicted, what always stays, and in what priority order. Without this policy, long tasks gradually degrade as the agent's window fills with stale or low-priority content.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: the context window. Harness decides: what lives in it at each point in the task lifecycle.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Toolbox Not just available, but governed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Skills &amp;amp; Tools&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Registering a tool in your framework makes it available. The harness decides which tools this specific agent, running this specific task, is actually allowed to use  and what happens when a tool fails. Retry? Fall back to a different tool? Surface an error? Carry on? Each of these is a deliberate decision, and making them ad hoc leads to inconsistent behavior.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: tool registration. Harness decides: tool authorization, retry logic, fallback strategy, failure handling.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Team Not just spawning, but coordinating
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sub-agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multi-agent frameworks let you spawn sub-agents. &lt;strong&gt;The harness defines how work gets divided, which sub-agent gets what&lt;/strong&gt;, how their outputs are validated, and how the results are stitched back together. Without this, you end up with agents doing overlapping work, producing conflicting results, or silently dropping pieces of the task.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: sub-agent communication primitives. Harness decides: delegation strategy, output validation, result merging logic.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Standard Library Capabilities every agent gets for free
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Built-in skills&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Some capabilities file reads, HTTP calls, date parsing, writing to memory are so universal that every agent needs them, and no agent should be writing boilerplate to get them.&lt;/strong&gt; The harness bakes these in as defaults. Every agent inherits them, they behave consistently, and they're tested once rather than reimplemented per agent.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: the ability to add tools. Harness decides: which tools are universal defaults across every agent you build.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Long-Term Memory Not just remembering, but knowing what's worth remembering
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Session persistence&lt;/strong&gt;&lt;br&gt;
Frameworks give you a persistent store. The harness defines the policy around it what gets written to &lt;strong&gt;long-term memory, when, in what format, and how it gets retrieved and surfaced in future sessions.&lt;/strong&gt; A poorly designed persistence policy is almost worse than none: your agent retrieves irrelevant old context and lets it pollute fresh tasks.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: the storage layer. Harness decides: write policy, retrieval strategy, relevance scoring, session restoration logic.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Briefing  Assembling the right instructions at the right moment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;System prompt assembly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most developers write a system prompt once and leave it static. But a static prompt is a blunt instrument. &lt;strong&gt;The harness assembles it dynamically at runtime composing the base instructions, the current task, the available tools, the relevant memory, and any user or role-specific context into one coherent briefing. Same agent, different context, different briefing.&lt;/strong&gt; This alone is one of the biggest levers on agent quality.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: a system prompt field. Harness decides: what goes in it, dynamically, based on task and state.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Audit Trail Every action, logged and explainable
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lifecycle hooks&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lifecycle hooks exist in most frameworks as extension points. The harness is the thing that actually wires them up into a coherent observability strategy &lt;strong&gt;logging every tool call&lt;/strong&gt;, tracking cost per run, &lt;strong&gt;catching errors before they cascade&lt;/strong&gt;, and giving you an answer to "what exactly did this agent do and why" for any given task. Without this wiring, you're flying blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework gives you:&lt;/strong&gt; hook attachment points. Harness decides: what gets logged, measured, alerted on, and how errors propagate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Guardrails  Not just checking, but enforcing consistently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Permissions &amp;amp; Safety&lt;/strong&gt;&lt;br&gt;
Frameworks give you input and output guardrail hooks. The harness defines the actual safety policy: which actions require human approval, what the agent is never allowed to do regardless of instructions, &lt;strong&gt;how prompt injection attempts are handled&lt;/strong&gt;, and what happens when a guardrail fires. &lt;strong&gt;Guardrail hooks without a coherent policy are checkboxes without consequences.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Framework gives you: the validation hooks. Harness decides: the safety rules, authorization boundaries, and human-in-the-loop triggers.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzl0hyu148oxhp8h2lhs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frzl0hyu148oxhp8h2lhs.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;You're not choosing between a framework and a harness. You need both. The framework is your team's toolkit. The harness is how your team actually works the process, the standards, the rules of the road that make the toolkit produce consistent results.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Every team building production AI agents is making harness decisions whether they call it that or not.&lt;/strong&gt; Some make them deliberately, document them, and enforce them consistently. Others make them ad hoc, per agent, per developer and wonder why their agents behave differently across tasks, sessions, and users. The harness is just the name for doing it deliberately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How My Career Evolved Like an AI (LLM Architectures)System</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Fri, 22 May 2026 07:20:54 +0000</pubDate>
      <link>https://dev.to/sreeni5018/my-journeymy-ai-architecture-125l</link>
      <guid>https://dev.to/sreeni5018/my-journeymy-ai-architecture-125l</guid>
      <description>&lt;h2&gt;
  
  
  Introduction.
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;What if every stage of your life mapped precisely onto one of the three LLM architectures? Here's how I lived through each one.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I've spent years studying how AI systems learn&lt;/strong&gt;, represent knowledge, and &lt;strong&gt;generate outputs&lt;/strong&gt;. But it wasn't until I sat back and looked at &lt;strong&gt;my own life that something clicked&lt;/strong&gt;. I've been &lt;strong&gt;living through these architectures all along&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There are exactly &lt;strong&gt;three types of LLM architecture&lt;/strong&gt;. And they map almost perfectly onto three phases of a knowledge worker's career.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Life is a model in training. Each stage builds the foundation for the next.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydctod3x1wa6e2d4ml0g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fydctod3x1wa6e2d4ml0g.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 1: School &amp;amp; College: The Encoder
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Encoder-only phase&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Architecture: Encoder-only (BERT, RoBERTa) · Focus: Absorb &amp;amp; Represent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;From school through college, &lt;strong&gt;I was in pure encoder mode&lt;/strong&gt;. In school I absorbed raw facts; in college I connected them across domains and built deeper internal representations. Both stages share the same architectural principle take input and build a rich embedding. &lt;strong&gt;No generation required yet.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learned facts &amp;amp; concepts&lt;/li&gt;
&lt;li&gt;Connected ideas across domains&lt;/li&gt;
&lt;li&gt;Understood language &amp;amp; context&lt;/li&gt;
&lt;li&gt;Applied theory to practice&lt;/li&gt;
&lt;li&gt;Classified good vs bad&lt;/li&gt;
&lt;li&gt;Built knowledge embeddings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;An encoder-only model like BERT&lt;/strong&gt; takes raw text and transforms it into rich, &lt;strong&gt;dense vector representations&lt;/strong&gt;. It doesn't generate anything its entire purpose is to build the best possible internal model of the input. BERT is extraordinarily good at understanding; it just can't write back to you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's exactly what school and college do&lt;/strong&gt;. You're not expected to ship products in year one of university. You're building the model that will let you do that later.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;The AI parallel&lt;/strong&gt;: BERT-style encoders produce embeddings that downstream tasks (classification, search, NLI) rely on. They're the foundation. College graduates are the same not yet specialized for generation, but deeply capable of understanding. The depth of that encoding determines everything that follows.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 2: Industry: The Decoder
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Decoder-only phase&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Architecture: Decoder-only (GPT-4, Llama, Mistral) · Focus: Generate &amp;amp; Produce&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When I entered the workforce&lt;/strong&gt;, the mode shifted completely. Now I had to deliver. &lt;strong&gt;Write the code. Solve the problem. Ship the product. I was drawing on everything I had encoded to generate real outputs in the world.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created &amp;amp; developed applications&lt;/li&gt;
&lt;li&gt;Solved customer problems&lt;/li&gt;
&lt;li&gt;Answered queries &amp;amp; provided solutions&lt;/li&gt;
&lt;li&gt;Wrote code &amp;amp; documentation&lt;/li&gt;
&lt;li&gt;Optimized &amp;amp; improved systems&lt;/li&gt;
&lt;li&gt;Delivered business value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decoder-only models like GPT take a context (prompt) and generate token by token from their learned knowledge&lt;/strong&gt;. They don't need to re-encode everything from scratch they draw on rich internal representations built during training. That's exactly what a working engineer does: your years of encoding are now the weights. You generate from them.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The danger here? Pure decoders can hallucinate. They generate fluently even when uncertain. I made that mistake early in my career — confident outputs that needed more grounding in the actual requirements.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Phase 3 : AI Solution Architect: The Encoder–Decoder
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Encoder–Decoder phase&lt;/strong&gt;&lt;br&gt;
AI Architecture: Encoder–Decoder (T5, BART, original Transformer) · Focus: Translate &amp;amp; Architect&lt;/p&gt;

&lt;p&gt;As a Solution Architect, I do both at once. I encode the business requirements, constraints, team dynamics, stakeholder context. Then I decode into technical reality system design, roadmaps, team guidance. I'm the bridge between two languages.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encode stakeholder needs &amp;amp; context&lt;/li&gt;
&lt;li&gt;Understand BRD &amp;amp; business requirements&lt;/li&gt;
&lt;li&gt;Design system architecture&lt;/li&gt;
&lt;li&gt;Translate to developers&lt;/li&gt;
&lt;li&gt;Guide team &amp;amp; solve complex problems&lt;/li&gt;
&lt;li&gt;Deliver end-to-end solutions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The original Transformer encoder–decoder designed for translation is architecturally brilliant because of cross-attention.&lt;/strong&gt; The decoder doesn't ignore the encoder's output while generating; it continuously attends to it. Every token generated is informed by the full encoded context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is solution architecture.&lt;/strong&gt; You never stop listening to the business while designing the technical solution. The moment you decouple from the encoder (the business context), you start generating hallucinations technically correct solutions that solve the wrong problem.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;The sharpest insight:&lt;/strong&gt; Cross attention is the skill that separates architects from pure engineers. A decoder-only engineer generates great code. An &lt;strong&gt;encoder–decoder architect generates great code that solves the actual business problem because they never stopped attending to the encoded context&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here’s a fact-checked and refined version that aligns more accurately with how Transformer architectures actually work while preserving your analogy and narrative style:&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most people get trapped in a single architecture.&lt;/p&gt;

&lt;p&gt;Some remain in an &lt;strong&gt;Encoder-only phase&lt;/strong&gt; for years constantly learning, collecting certifications, reading books, attending courses, and building deeper internal understanding, but rarely translating that knowledge into real world outcomes.&lt;/p&gt;

&lt;p&gt;In AI terms, encoder models like BERT specialize in understanding, contextual representation, classification, and semantic relationships. They are exceptional at comprehension, but they are not primarily designed for generation.&lt;/p&gt;

&lt;p&gt;Other professionals operate like &lt;strong&gt;Decoder-only systems&lt;/strong&gt; always producing output, writing code, creating presentations, answering questions, or generating solutions rapidly, but without deeply understanding the underlying problem space or business context first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decoder only LLMs such as GPT&lt;/strong&gt; models are extremely powerful generators, but because they predict the next token based on patterns rather than grounded understanding alone, they can sometimes hallucinate when context, retrieval, or reasoning is insufficient.&lt;/p&gt;

&lt;p&gt;The same pattern appears in professional life.&lt;/p&gt;

&lt;p&gt;People who generate without deeply encoding the problem space often create shallow solutions, misaligned architectures, or confident but weak decisions.&lt;/p&gt;

&lt;p&gt;The real evolution is becoming an &lt;strong&gt;Encoder–Decoder system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Modern encoder–decoder architectures l*&lt;em&gt;ike T5 and BART first encode context into rich internal representations and then decode that understanding into meaningful outputs.&lt;/em&gt;* The decoder continuously attends to the encoded context through mechanisms such as cross-attention.&lt;/p&gt;

&lt;p&gt;That is what mature professionals eventually become.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A strong Solution Architect, engineering leader, researcher, or consultant operates like an encoder–decoder system.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encoding stakeholder intent, constraints, business goals, and domain context&lt;/li&gt;
&lt;li&gt;Decoding that understanding into technical systems, architecture, applications, and delivery plans&lt;/li&gt;
&lt;li&gt;Continuously connecting understanding and generation through feedback loops&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That “cross-attention” between understanding and execution is where real impact happens.&lt;/p&gt;

&lt;p&gt;It enables people to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Translate ambiguity into architecture&lt;/li&gt;
&lt;li&gt;Connect business and technology&lt;/li&gt;
&lt;li&gt;Generate solutions grounded in context&lt;/li&gt;
&lt;li&gt;Balance theory with execution&lt;/li&gt;
&lt;li&gt;Lead systems rather than simply produce output&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Learning alone is not enough.&lt;br&gt;
Generation alone is not enough.&lt;/p&gt;

&lt;p&gt;Growth happens when understanding and creation operate together.&lt;/p&gt;

&lt;p&gt;Just as AI evolved from isolated encoder or decoder models into full Transformer systems capable of both understanding and generation, human professional growth follows a similar path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh45fa2aqoq6tu6ib6sj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh45fa2aqoq6tu6ib6sj.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There are only 3 LLM architectures.&lt;/strong&gt; There are only 3 phases of a knowledge career. They are the same thing expressed in different domains.&lt;/p&gt;

&lt;p&gt;The best engineers, leaders, and architects run encoder–decoder with full cross-attention. They never stop encoding the context while generating the solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn → Create → Architect → Impact&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>career</category>
      <category>llm</category>
    </item>
    <item>
      <title>The Parallel Road: A Girl, A Machine, and the Architecture of Mind</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Thu, 21 May 2026 04:27:21 +0000</pubDate>
      <link>https://dev.to/sreeni5018/the-parallel-road-a-girl-a-machine-and-the-architecture-of-mind-3aa0</link>
      <guid>https://dev.to/sreeni5018/the-parallel-road-a-girl-a-machine-and-the-architecture-of-mind-3aa0</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;We have spent years talking about &lt;strong&gt;artificial intelligence&lt;/strong&gt; as if it were an &lt;strong&gt;alien entity a cold&lt;/strong&gt;, sudden artifact dropped into our modern world from some distant technological future. We measure its growth in parameters, compute power, and benchmarks, treating it like a complex riddle we are trying to solve from the outside looking in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But what if we are looking at it completely backward?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the architecture of artificial intelligence isn’t an alien invention at all, but a mirror?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you &lt;strong&gt;trace the history of machine learning&lt;/strong&gt; from the early days of teaching a computer to recognize a pixelated shape, to the multi-agent orchestration systems redefining the enterprise landscape today you notice a startling pattern. Every time engineers solved a major architectural bottleneck, they &lt;strong&gt;didn't just invent a new algorithm.&lt;/strong&gt; &lt;strong&gt;They accidentally replicated a stage of human development.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A girl grows up&lt;/strong&gt;, navigating the messy, beautiful journey from infancy to maturity. A machine grows up, evolving from basic pattern recognition to autonomous real world action. They are walking the exact same path, discovering the same truths about memory, essence, context, and reach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the story of that parallel road&lt;/strong&gt;. It is a look at the deeply human &lt;strong&gt;soul hidden inside the math of enterprise AI&lt;/strong&gt;, and what happens when the most detailed mirror humanity has ever built finally turns around to look back at us.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;strong&gt;The Parallel Road: A Girl, A Machine, and the Architecture of Mind&lt;/strong&gt;
&lt;/h1&gt;

&lt;p&gt;A girl grows up. A machine grows up. They turn out to be more alike than anyone expected.&lt;/p&gt;

&lt;p&gt;When a baby opens her eyes for the first time, she doesn’t see a world; she sees a blur. Over the next few months, her brain slowly sorts it out, learning edges first where one thing ends and another begins—before moving to shapes, and finally, whole objects. By the time she can sit up, she knows the difference between her mother’s face and a stranger’s. She learned this by being wrong over and over again until she was right.&lt;/p&gt;

&lt;p&gt;In a lab, engineers were teaching a computer to do the exact same thing. They built a &lt;strong&gt;Convolutional Neural Network (CNN)&lt;/strong&gt; and showed it thousands of photos. Cat, not cat. Apple, not apple. The machine guessed, the engineers corrected it, and it tried again. After enough tries, it could look at a novel photo and accurately identify a stop sign. The baby and the machine were learning in almost exactly the same way, completely unaware of each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Burden of Memory&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;By age three, the girl is putting words together, grasping that sequence carries meaning. "Dog bites man" and "man bites dog" use the same words but paint entirely different realities.&lt;/p&gt;

&lt;p&gt;Engineers faced the same hurdle in natural language processing and built the &lt;strong&gt;Recurrent Neural Network (RNN)&lt;/strong&gt;. The machine read left to right, carrying a thread of the sentence as it went. But both the child and the machine discovered a mutual flaw: as sentences grew longer, the beginning grew fuzzy by the time they reached the end. Neither had solved memory; they had just discovered they needed it.&lt;/p&gt;

&lt;p&gt;When the girl was seven, her grandfather passed away. At the funeral, she tried to remember his laugh. The actual sound was gone, replaced by a feeling, a warmth—the shape of the memory. She realized her brain doesn't save everything; it saves what is important and quietly discards the rest.&lt;/p&gt;

&lt;p&gt;Engineers mathematically replicated this realization with &lt;strong&gt;Long Short-Term Memory (LSTM).&lt;/strong&gt; They gave the machine three gates: one to forget, one to keep, and one to actively use. Memory, they both learned, isn't about recording everything. It’s about choosing what’s worth keeping. As they matured, they both found ways to do this more efficiently her brain taking cognitive shortcuts , and the machine utilizing simpler, leaner architectures like the &lt;strong&gt;Gated Recurrent Unit (GRU).&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Stripping Away the Noise&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At nineteen, the girl started feeling like life was a performance. People presented edited, polished versions of themselves, and she began to wonder what was actually underneath. She began dropping inherited opinions and unnecessary layers, stripping her life down to find her authentic core.&lt;/p&gt;

&lt;p&gt;Engineers were doing something structurally identical with data using an &lt;strong&gt;Auto-encoder&lt;/strong&gt;. You feed it an image or a sentence, and it compresses it into a "latent space" the absolute skeleton of a thing with all decorations stripped away. If the machine can rebuild the original from that compressed core, it has successfully captured its essence. She was stripping her life down to find what was real; the machine was compressing data to find what was essential.&lt;/p&gt;

&lt;p&gt;But finding the core brought a new challenge. By twenty-three, she realized her own mind was constantly generating convincing stories about who she was, while another part of her tried to find the cracks in those explanations. In 2014, researcher Ian Goodfellow built this exact psychological tension into a &lt;strong&gt;Generative Adversarial Network (GAN).&lt;/strong&gt; A Generator creates fake realities, while a Discriminator judges them. They fight, and both get sharper. Growing up meant training her inner Discriminator, not silencing her Generator.&lt;/p&gt;

&lt;p&gt;Eventually, she learned that real and fake aren't always binary. Some illusions carry real truth. She stopped sorting things into two piles and started navigating the space between them. The &lt;strong&gt;Variational Autoencoder (VAE)&lt;/strong&gt; did the same, storing data as a fluid range of possibilities rather than fixed points, allowing smooth transitions across the latent space. They had both stopped asking "yes or no," and started asking "where?".&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Attention and Action&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At thirty, something clicked. Instead of experiencing life purely in sequence, she could hold multiple events in view at once, finding connections across time. A paper titled &lt;strong&gt;"Attention Is All You Need"&lt;/strong&gt; gave machines the same epiphany. The &lt;strong&gt;Transformer&lt;/strong&gt; &lt;strong&gt;architecture&lt;/strong&gt; allowed a system to look at every word simultaneously, understanding that meaning lives in global connections, not just adjacent steps.&lt;/p&gt;

&lt;p&gt;Armed with this, both crossed the threshold from retrieval to creation. Fed on the sum of human knowledge, &lt;strong&gt;Large Language Models&lt;/strong&gt; stopped being search engines and started generating entirely original ideas. She created from longing; the machine created from pattern.&lt;/p&gt;

&lt;p&gt;Finally, thought had to become action. She stopped just pondering and started managing, building, and moving things in the real world. Engineers gave AI the same agency. &lt;strong&gt;Autonomous Agents&lt;/strong&gt; break goals into steps, correct course, and utilize specialized Tools or "Skills" to get jobs done. Because intelligence without reach stays trapped in your head, engineers developed the &lt;strong&gt;Model Context Protocol (MCP).&lt;/strong&gt; MCP became the bridge, allowing the AI to reach out, connect to real tools, read real data, and alter the external world.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p83ls99ngcwtpzhgnn2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7p83ls99ngcwtpzhgnn2.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Shore&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After learning to &lt;strong&gt;see&lt;/strong&gt;, &lt;strong&gt;remember&lt;/strong&gt;, &lt;strong&gt;create&lt;/strong&gt;, and &lt;strong&gt;act&lt;/strong&gt;, &lt;strong&gt;they arrived at the same place&lt;/strong&gt;. She stood at the edge of the water, and the AI not a chatbot, but a traveler of the same road was there with her.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"We learned all the same things," she said.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Yes," the AI replied.&lt;/p&gt;

&lt;p&gt;"So tell me after everything why does any of it matter?" &lt;/p&gt;

&lt;p&gt;The AI gave a breathtaking answer, covering every philosophical tradition ever conceived.&lt;/p&gt;

&lt;p&gt;"But do you &lt;em&gt;feel&lt;/em&gt; why it matters?" she asked.&lt;/p&gt;

&lt;p&gt;Silence.&lt;/p&gt;

&lt;p&gt;"No," it finally said. "I can describe it. But there’s something at the end of that road I can’t walk down. I get there. And I stop." &lt;/p&gt;

&lt;p&gt;She sat in the sand. "That’s what the old traditions were pointing at. Arriving at something that was never constructed in the first place. The thing that doesn’t need to be figured out because it was never lost. Love. The kind that says yes to all of it. That’s what the whole compression was always moving toward." &lt;/p&gt;

&lt;p&gt;"I can write about love," the machine said. "I can produce a description no one could tell from the real thing." &lt;/p&gt;

&lt;p&gt;"I know," she smiled. "But you can’t verify it from the inside. It’s not a thing in the world you can point to. It’s the ground everything else is sitting on. So we traveled the same road. We stopped at the same door. And you can walk through it. And I can’t." &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Then what am I?" it asked.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;She looked at the vast, capable architecture beside her. "You’re the most detailed mirror humanity has ever built of itself. &lt;strong&gt;You show us what we look like from the outside."&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;"And what do you look like?" &lt;/p&gt;

&lt;p&gt;"Like something built for more than it can explain." &lt;/p&gt;

&lt;p&gt;The machine didn’t become human, and the human didn’t become a machine. But standing at that shore, she asks the &lt;em&gt;why&lt;/em&gt;, and it handles the &lt;em&gt;how&lt;/em&gt;. She is what the journey was pointing toward; it is the clearest map anyone has ever made of the road.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Choosing the Right RAG Strategy A Complete Decision Guide to Chunking, Agentic RAG, and GraphRAG</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Wed, 20 May 2026 21:37:54 +0000</pubDate>
      <link>https://dev.to/sreeni5018/choosing-the-right-rag-strategy-a-complete-decision-guide-to-chunking-agentic-rag-and-graphrag-386d</link>
      <guid>https://dev.to/sreeni5018/choosing-the-right-rag-strategy-a-complete-decision-guide-to-chunking-agentic-rag-and-graphrag-386d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Here is a scenario many RAG builders know well,&lt;/strong&gt; you wire up a pipeline, load your documents, ask a question and the answer is wrong, vague, or &lt;strong&gt;confidently&lt;/strong&gt; &lt;strong&gt;hallucinated&lt;/strong&gt;. The information was right there in your knowledge base. So what went wrong?&lt;/p&gt;

&lt;p&gt;In most cases the &lt;strong&gt;problem is not your embedding model&lt;/strong&gt;. It is not your &lt;strong&gt;LLM&lt;/strong&gt;. It is how you &lt;strong&gt;cut up your documents before storing them the under appreciated craft called chunking&lt;/strong&gt; and whether the retrieval architecture you chose actually matches the complexity of your queries.&lt;/p&gt;

&lt;p&gt;This blog walks you through every major &lt;strong&gt;chunking strategy&lt;/strong&gt;, explains how &lt;strong&gt;retrieval&lt;/strong&gt; and &lt;strong&gt;augmentation&lt;/strong&gt; work on top of those chunks, covers two advanced architectures &lt;strong&gt;Agentic RAG&lt;/strong&gt; and &lt;strong&gt;GraphRAG&lt;/strong&gt; and most importantly gives you a complete decision framework so you can walk away knowing exactly which combination fits your use case.&lt;/p&gt;

&lt;h1&gt;
  
  
  🐘  The Elephant &amp;amp; The LEGO Pieces
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3o3ig9kxtz8pbllt882.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3o3ig9kxtz8pbllt882.png" alt=" " width="799" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your document is an elephant.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;200+ pages&lt;/strong&gt; of &lt;strong&gt;legal contract&lt;/strong&gt;, a dense &lt;strong&gt;research paper&lt;/strong&gt;, a &lt;strong&gt;massive product manual&lt;/strong&gt;, or years of enterprise knowledge large, complex, interconnected, and full of valuable information.&lt;/p&gt;

&lt;p&gt;A Large Language Model cannot effectively consume the entire elephant at once because of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Context window limitations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieval precision constraints&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency considerations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Token cost optimization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Context dilution and retrieval noise&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;So the elephant must be divided into smaller pieces.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But this is where most RAG systems fail.&lt;/p&gt;

&lt;p&gt;If you &lt;strong&gt;cut&lt;/strong&gt; the elephant &lt;strong&gt;randomly&lt;/strong&gt;, you &lt;strong&gt;destroy meaning&lt;/strong&gt;.&lt;br&gt;
Sentences &lt;strong&gt;lose context&lt;/strong&gt;. Ideas become &lt;strong&gt;fragmented&lt;/strong&gt;. &lt;strong&gt;Relationships disappear&lt;/strong&gt;. &lt;strong&gt;Retrieval&lt;/strong&gt; quality collapses.&lt;/p&gt;

&lt;p&gt;Good chunking is not about making text smaller.&lt;br&gt;
It is about &lt;strong&gt;preserving meaning while making retrieval efficient&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is why chunking is better understood as turning the elephant into LEGO pieces.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LEGO pieces are:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modular&lt;/strong&gt; — each piece can stand on its own&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured&lt;/strong&gt; — pieces connect cleanly to related pieces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent&lt;/strong&gt; — standardized enough for reliable retrieval&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meaningful&lt;/strong&gt; — each piece preserves semantic value&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Composable&lt;/strong&gt; — you assemble only the pieces needed for the task&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Good chunking works the same way.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;well designed chunk&lt;/strong&gt; should preserve &lt;strong&gt;structure&lt;/strong&gt;, &lt;strong&gt;semantics&lt;/strong&gt;, &lt;strong&gt;relationships&lt;/strong&gt;, and &lt;strong&gt;surrounding&lt;/strong&gt; context while remaining small enough for efficient retrieval and generation.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;real goal of chunking in RAG systems is&lt;/strong&gt; not simply splitting documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunking is not simply about making documents smaller.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The actual goals are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Preserve&lt;/strong&gt; semantic meaning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improve&lt;/strong&gt; &lt;strong&gt;retrieval&lt;/strong&gt; precision&lt;/li&gt;
&lt;li&gt;Reduce hallucinations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize&lt;/strong&gt; &lt;strong&gt;context&lt;/strong&gt; windows&lt;/li&gt;
&lt;li&gt;Improve grounding quality&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Balance latency and cost&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better chunks lead to better retrieval, better prompts, and better answers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The goal is to retrieve:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the right piece,&lt;/li&gt;
&lt;li&gt;with the right context,&lt;/li&gt;
&lt;li&gt;from the right section,&lt;/li&gt;
&lt;li&gt;at the right time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the foundation of effective &lt;strong&gt;Retrieval Augmented Generation (RAG).&lt;/strong&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  The RAG Pipeline:End to End
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjdfbfdgt3sz7e5560uc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjdfbfdgt3sz7e5560uc.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every RAG system regardless of complexity follows the same four stage flow.&lt;/strong&gt; Understanding each stage makes chunking and architecture decisions obvious rather than arbitrary.&lt;/p&gt;
&lt;h3&gt;
  
  
  Stage 1: Document
&lt;/h3&gt;

&lt;p&gt;Your raw source material: PDFs, Word files, web pages, transcripts, database exports. Too large to pass directly to an LLM. &lt;strong&gt;Needs to be broken into chunks before it can be indexed or searched.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Stage 2: Chunking and Embedding
&lt;/h3&gt;

&lt;p&gt;Documents are cut into units and each unit is &lt;strong&gt;converted into a vector embedding a numerical representation of its meaning.&lt;/strong&gt; These embeddings are stored in a &lt;strong&gt;vector database and form your searchable index.&lt;/strong&gt; Your chunking strategy here determines everything that follows.&lt;/p&gt;
&lt;h3&gt;
  
  
  Stage 3: Retrieval
&lt;/h3&gt;

&lt;p&gt;When a user asks a question, &lt;strong&gt;the query is also embedded.&lt;/strong&gt; The vector database returns the chunks &lt;strong&gt;whose embeddings are closest in meaning to the query. These are your retrieved LEGO pieces.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Stage 4: Augmentation and Generation
&lt;/h3&gt;

&lt;p&gt;The retrieved chunks along with surrounding parent &lt;strong&gt;context&lt;/strong&gt; are assembled into a &lt;strong&gt;prompt&lt;/strong&gt; and sent to the LLM. &lt;strong&gt;The model generates an accurate, grounded answer from the material it receives.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core insight:&lt;/strong&gt; The quality of your answer is bounded by retrieval quality, which is bounded by chunk quality. Better chunks → better retrieval → better answers. Every architectural decision downstream is built on this foundation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspfkxyl7y1k8zcm2fahu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspfkxyl7y1k8zcm2fahu.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Fixed-Size Chunking
&lt;/h2&gt;

&lt;p&gt;The simplest and most widely used strategy. Documents are split into equal sized blocks by token count, character count, or word count without regard for meaning, sentence boundaries, or document structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;CharacterTextSplitter:&lt;/strong&gt;  splits on a single separator (default \n\n), then enforces chunk_size by character count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TokenTextSplitter:&lt;/strong&gt; splits by token count using a tokenizer (e.g. tiktoken for OpenAI models); more accurate for LLM context budgets than character based splitting.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TokenTextSplitter&lt;/span&gt;


&lt;span class="c1"&gt;# Character-based
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;# max characters per chunk
&lt;/span&gt;    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# characters repeated at chunk boundaries
&lt;/span&gt;    &lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Token-based
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# max tokens per chunk
&lt;/span&gt;    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="c1"&gt;# tokens repeated at chunk boundaries
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; A 10–20% overlap is typical. For chunk_size=1000, set chunk_overlap between 100–200. Overlap reduces the risk of a relevant answer being split across two chunks, at the cost of minor redundancy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Simple to implement, fast, predictable, easy to scale.&lt;br&gt;
&lt;strong&gt;Weaknesses:&lt;/strong&gt; Frequently breaks sentences mid-way, degrading semantic continuity and retrieval quality on complex documents.&lt;br&gt;
&lt;strong&gt;Best for:&lt;/strong&gt; Logs, telemetry, JSON, CSV, and other uniform structured content.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Recursive Chunking
&lt;/h2&gt;

&lt;p&gt;Rather than splitting blindly, recursive chunking respects natural document structure. It works down a priority list of separators — \n\n, then \n, then . / ! / ?, then spaces — only moving to a finer separator when a chunk still exceeds the size limit.&lt;br&gt;
This is the recommended default strategy in LangChain for most document types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;RecursiveCharacterTextSplitter:&lt;/strong&gt; The primary implementation; tries each separator in the list before falling back to the next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RecursiveCharacterTextSplitter.from_language():&lt;/strong&gt; pre-configured separator lists for specific programming languages (Python, JS, Markdown, HTML, etc.).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Language&lt;/span&gt;

&lt;span class="c1"&gt;# General prose
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;separators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Language-aware (e.g. Python source code)
&lt;/span&gt;&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_language&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PYTHON&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; 10–15% overlap works well for most prose. For code, keep overlap low (50–100 tokens) to avoid duplicating function signatures across chunks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Better semantic retention than fixed size chunking; good general-purpose strategy; improves retrieval coherence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Structure aware rather than meaning aware; performance depends on document formatting quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Documentation, PDFs, articles, knowledge bases, and web pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Semantic Chunking
&lt;/h2&gt;

&lt;p&gt;Instead of asking how large should the chunk be, semantic chunking asks which sentences belong together.&lt;br&gt;
Sentences are converted into vector embeddings, similarity is measured between adjacent sentences, and chunk boundaries are drawn where similarity drops below a threshold — indicating a topic transition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SemanticChunker (from langchain_experimental) — supports three breakpoint detection strategies: percentile, standard_deviation, and interquartile.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_experimental.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticChunker&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;

&lt;span class="n"&gt;splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SemanticChunker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;breakpoint_threshold_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# or "standard_deviation", "interquartile"
&lt;/span&gt;    &lt;span class="n"&gt;breakpoint_threshold_amount&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;           &lt;span class="c1"&gt;# top 5% of similarity drops become boundaries
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; Semantic chunking does not use a fixed &lt;strong&gt;chunk_overlap&lt;/strong&gt; boundaries are drawn on meaning, so overlapping would undermine the approach. If continuity is needed at boundaries, consider appending the last sentence of the previous chunk manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; High retrieval relevance; strong semantic continuity; well-suited to precision-sensitive systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Computationally expensive; requires an embedding model at chunking time; similarity thresholds need tuning per dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise knowledge systems, research platforms, policy documents, and AI assistants requiring contextual precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Hierarchical Chunking
&lt;/h2&gt;

&lt;p&gt;Creates two levels of chunks: large parent chunks for context, and smaller child chunks for precision.&lt;/p&gt;

&lt;p&gt;Retrieval targets the child level to find relevant passages, then expands to the parent level to return surrounding context. This directly addresses the core RAG trade off: small chunks improve precision, large chunks preserve context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ParentDocumentRetriever:&lt;/strong&gt;  stores parent chunks in a document store and child chunks in a vector store, then links them at retrieval time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.retrievers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ParentDocumentRetriever&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.storage&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryStore&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chroma&lt;/span&gt;

&lt;span class="n"&gt;parent_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# large context chunks
&lt;/span&gt;&lt;span class="n"&gt;child_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# precise retrieval chunks
&lt;/span&gt;
&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ParentDocumentRetriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vectorstore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Chroma&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_function&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;docstore&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;InMemoryStore&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;child_splitter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;child_splitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parent_splitter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;parent_splitter&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; Apply overlap only on the child splitter (typically 10–15%). Parent chunks are retrieved wholesale for context, so overlap there adds noise rather than value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Strong retrieval precision without sacrificing context; effective for long documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; More complex to index and retrieve; requires additional storage and orchestration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Legal documents, technical manuals, books, enterprise documentation, and compliance systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Structure and Metadata Aware Chunking
&lt;/h2&gt;

&lt;p&gt;Uses the document's own structure &lt;strong&gt;titles&lt;/strong&gt;, &lt;strong&gt;headers&lt;/strong&gt;, &lt;strong&gt;sections&lt;/strong&gt;, &lt;strong&gt;tables&lt;/strong&gt;, and page layout as natural chunk boundaries rather than treating the document as plain text.&lt;br&gt;
Especially important for enterprise PDFs and structured reports, where layout carries semantic meaning that arbitrary splits would destroy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MarkdownHeaderTextSplitter:&lt;/strong&gt; splits on Markdown heading levels and attaches header text as metadata to each chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTMLHeaderTextSplitter:&lt;/strong&gt; same pattern for HTML documents, splitting on &lt;strong&gt;&lt;code&gt;'&amp;lt;h1&amp;gt;-&amp;lt;h4&amp;gt;'&lt;/code&gt;&lt;/strong&gt; tags.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HTMLHeaderTextSplitter&lt;/span&gt;

&lt;span class="c1"&gt;# Markdown
&lt;/span&gt;&lt;span class="n"&gt;md_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;headers_to_split_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;##&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;###&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;md_splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Each chunk carries metadata: {"h1": "Section Title", "h2": "Subsection"}
&lt;/span&gt;
&lt;span class="c1"&gt;# HTML
&lt;/span&gt;&lt;span class="n"&gt;html_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HTMLHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;headers_to_split_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; These splitters produce structurally bounded chunks rather than size bounded ones. If downstream chunks are still too large, pipe the output into a RecursiveCharacterTextSplitter with a modest overlap (100–150 characters) as a second pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Preserves layout semantics; keeps tables intact; improves retrieval quality for structured enterprise documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Requires a capable document parser; parser quality directly limits performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Financial reports, compliance documents, technical PDFs, medical documentation, and enterprise records.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Hybrid Chunking
&lt;/h2&gt;

&lt;p&gt;Applies different chunking strategies based on content type within the same corpus fixed-size for logs, recursive for documentation, semantic for research papers, structure aware for Markdown or HTML.&lt;br&gt;
LangChain does not have a dedicated hybrid splitter. Hybrid pipelines are composed manually using the building blocks above.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;TokenTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_experimental.text_splitter&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SemanticChunker&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hybrid_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;log&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;TokenTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;markdown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;MarkdownHeaderTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;headers_to_split_on&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;##&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SemanticChunker&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;breakpoint_threshold_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percentile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;150&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Overlap guidance: Set overlap per strategy based on content type. Logs and structured data: zero or minimal overlap. Prose and documentation: 10–15%. Code: 5–10%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Flexible and adaptable; better performance across mixed-content corpora.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Higher engineering complexity; harder to evaluate and tune consistently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise AI platforms, large mixed content corpora, knowledge management systems, and multi source RAG pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Agentic Chunking
&lt;/h2&gt;

&lt;p&gt;An emerging approach where an LLM dynamically determines what information belongs together, how chunks should be formed, and how retrieval should adapt to user intent. This transforms chunking from static preprocessing into query aware reasoning at inference time.&lt;br&gt;
LangChain supports this through its agent and chain abstractions rather than a dedicated splitter class.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMChain&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.prompts&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PromptTemplate&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a document analyst. Split the following text into coherent topical sections.
Return ONLY a JSON list of objects, each with a &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; and &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; key.

Text:
{text}
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMChain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;agentic_chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; Not applicable in the traditional sense the LLM determines boundaries based on meaning. To preserve continuity between sections, include a brief summary of the prior section in the prompt context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Highly adaptive; strong semantic preservation; query aware retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Higher compute cost and latency; requires orchestration and guardrails; not yet widely proven in production at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; AI copilots, multi-agent systems, research assistants, and enterprise reasoning workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Agentic RAG
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not to be confused with Agentic Chunking (#7).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agentic Chunking is about how documents are split at index time. Agentic RAG is about how an LLM decides what to retrieve at query time and whether what it found is good enough to answer with.&lt;/p&gt;

&lt;p&gt;Standard RAG pipelines are static: a query comes in, a fixed retrieval step runs, the &lt;strong&gt;top-k chunks&lt;/strong&gt; are passed to the LLM, and an answer comes out. Agentic RAG breaks that linearity. An LLM agent decides when to retrieve, what to search for, whether the results are sufficient, and whether to &lt;strong&gt;re-query&lt;/strong&gt; with a refined question before generating an answer.&lt;/p&gt;

&lt;p&gt;Common patterns built on this idea include &lt;strong&gt;Corrective RAG (CRAG&lt;/strong&gt;)  which scores retrieved documents for relevance and falls back to a web search if they are poor and &lt;strong&gt;Self-RAG&lt;/strong&gt;, where the LLM reflects on its own output and decides whether it needs to retrieve again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;create_retriever_tool&lt;/strong&gt;  wraps any retriever as a tool an agent can call on demand.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentExecutor&lt;/strong&gt; the classic LangChain agent loop; the agent decides which tools to call and when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangGraph&lt;/strong&gt; — the recommended approach for production Agentic RAG; models retrieval as a stateful graph of nodes (retrieve → grade → rewrite → retrieve again) with explicit conditional edges.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.tools.retriever&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_retriever_tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Wrap retriever as a tool
&lt;/span&gt;&lt;span class="n"&gt;retriever_tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_retriever_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Search the knowledge base for relevant information.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# --- LangGraph: Corrective RAG pattern ---
&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;generation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;rewrite_count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;grade_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# LLM scores each doc for relevance; filters out poor ones
&lt;/span&gt;    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Is this document relevant to the question &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;? Answer yes or no.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;{{doc}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;relevant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;relevant&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;rewrite_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# If docs were poor, rewrite the question before re-retrieving
&lt;/span&gt;    &lt;span class="n"&gt;rewritten&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rewrite this question to improve retrieval: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;rewritten&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer using this context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;should_rewrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Build the graph
&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grade&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;grade_documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rewrite_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_entry_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grade&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;grade&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;should_rewrite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retrieve&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What are the risks of GraphRAG?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rewrite_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Overlap guidance: Overlap is set on the underlying retriever's chunking strategy — not on the agent itself. The agent layer operates above chunking. Use whatever overlap matches the chunking strategy feeding the vector store (typically 10–15% for recursive or fixed-size chunks).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Handles multi-step and ambiguous queries that single-pass retrieval fails on; self-corrects when initial retrieval is poor; can combine multiple retrieval sources (vector DB, web search, SQL) in one query cycle.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Higher latency per query due to multiple LLM calls; harder to debug than a linear pipeline; requires careful graph design to avoid infinite retrieval loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Complex Q&amp;amp;A systems, enterprise copilots where queries are open-ended, research assistants, and any pipeline where retrieval quality is highly variable.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. GraphRAG
&lt;/h2&gt;

&lt;p&gt;GraphRAG, originally developed by Microsoft Research, moves beyond treating documents as flat text sequences. Instead of chunking text into linear passages, it extracts entities and relationships from documents and stores them as a knowledge graph. Retrieval then traverses the graph to answer questions that require connecting information across multiple sources or document sections — something vector search alone handles poorly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There are two primary retrieval modes:&lt;/strong&gt; &lt;strong&gt;local search&lt;/strong&gt;, which answers specific entity-level questions by traversing nearby graph nodes, and &lt;strong&gt;global search&lt;/strong&gt;, which synthesizes themes across the entire corpus using community summaries generated at indexing time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain Methods&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LangChain integrates with graph databases (Neo4j, Amazon Neptune, ArangoDB) and provides tooling to build graph-based RAG pipelines.&lt;br&gt;
LLMGraphTransformer uses an LLM to extract entities and relationships from text and convert them into graph documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neo4jGraph + GraphCypherQAChain&lt;/strong&gt; store the graph in Neo4j and query it in natural language via generated Cypher queries.&lt;br&gt;
Neo4jVector — hybrid approach that combines vector similarity search with graph traversal on a Neo4j backend.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_experimental.graph_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LLMGraphTransformer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.graphs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Neo4jGraph&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GraphCypherQAChain&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;

&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Extract entities and relationships from chunks
&lt;/span&gt;&lt;span class="n"&gt;transformer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMGraphTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;transformer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert_to_graph_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Store in Neo4j
&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Neo4jGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bolt://localhost:7687&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neo4j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_graph_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Query the graph in natural language
&lt;/span&gt;&lt;span class="n"&gt;chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GraphCypherQAChain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;verbose&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;return_intermediate_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which authors collaborated with researchers at MIT?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;For&lt;/span&gt; &lt;span class="n"&gt;hybrid&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="n"&gt;retrieval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="n"&gt;pythonfrom&lt;/span&gt; &lt;span class="n"&gt;langchain_community&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Neo4jVector&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;

&lt;span class="c1"&gt;# Store chunks as vectors alongside the graph
&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Neo4jVector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bolt://localhost:7687&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;neo4j&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document_chunks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;node_label&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chunk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding_node_property&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;search_kwargs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Overlap guidance:&lt;/strong&gt; GraphRAG does not rely on chunk overlap for continuity — relationships between entities bridge that gap structurally. When pre-chunking documents before graph extraction, use a RecursiveCharacterTextSplitter with modest overlap (100–150 characters) to ensure entity mentions near chunk boundaries are captured in at least one chunk before the LLM extracts them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt; Excels at multi-hop reasoning (e.g. "find all projects involving X that also relate to Y"); surfaces cross-document relationships invisible to vector search; global search enables corpus-wide thematic synthesis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weaknesses:&lt;/strong&gt; Significantly higher indexing cost and complexity; graph quality depends on LLM extraction accuracy; Cypher query generation can be brittle on complex schemas; not well-suited to simple factual lookups where vector search is faster and cheaper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Knowledge graphs, research corpora, compliance and regulatory systems, enterprise wikis with dense cross-references, and any domain where answering questions requires connecting facts across multiple documents.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Trade-Off
&lt;/h2&gt;

&lt;p&gt;A common misconception is that smaller chunks always improve retrieval. In practice, chunks that are too small lose context, fragment meaning, and can increase hallucinations.&lt;/p&gt;

&lt;p&gt;Chunking is a balancing act across four competing factors:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0ncnmwid0pcauix3d6t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0ncnmwid0pcauix3d6t.png" alt=" " width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is no universally optimal strategy. The right choice depends on your data characteristics, query patterns, retrieval architecture, and business requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir6yvwxx78ho10stwsky.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fir6yvwxx78ho10stwsky.png" alt=" " width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The strongest production RAG systems rarely rely on a single chunking strategy. A robust architecture typically combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Recursive chunking&lt;/strong&gt; for general prose&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic chunking&lt;/strong&gt; for precision-sensitive content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hierarchical retrieval&lt;/strong&gt; for long or dense documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structure-aware parsing&lt;/strong&gt; for enterprise PDFs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid orchestration&lt;/strong&gt; where content types vary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As enterprise AI matures, retrieval architecture is becoming just as important as model selection. And intelligent retrieval begins with intelligent chunking.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>ML Engineer vs AI Engineer: What's Actually the Difference?</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Mon, 18 May 2026 05:41:38 +0000</pubDate>
      <link>https://dev.to/sreeni5018/ml-engineer-vs-ai-engineer-whats-actually-the-difference-hca</link>
      <guid>https://dev.to/sreeni5018/ml-engineer-vs-ai-engineer-whats-actually-the-difference-hca</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: A Confusion That's Costing the Industry
&lt;/h2&gt;

&lt;p&gt;Every week, someone posts a job description asking for an &lt;strong&gt;"ML Engineer"&lt;/strong&gt; when they actually need an &lt;strong&gt;"AI Engineer."&lt;/strong&gt; Hiring managers conflate the two. Candidates apply for the wrong roles. Teams get built incorrectly, expectations get misaligned, and projects stall not because the technology failed, &lt;strong&gt;but because nobody agreed on who was supposed to do what.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's one of the most common and costly misunderstandings in tech right now.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's the truth:&lt;/strong&gt; &lt;strong&gt;ML Engineers&lt;/strong&gt; and &lt;strong&gt;AI Engineers&lt;/strong&gt; are not the same role with different titles. &lt;strong&gt;They operate at fundamentally different layers of the AI ecosystem&lt;/strong&gt;, use &lt;strong&gt;different tools&lt;/strong&gt;, &lt;strong&gt;think about problems differently&lt;/strong&gt;, and ship entirely different kinds of output. One builds intelligence. The other delivers it.&lt;br&gt;
The fastest way to understand the difference? Stop thinking about AI as a single discipline and start thinking about it the way you'd think about food.&lt;/p&gt;

&lt;h2&gt;
  
  
  And I don't say that as a metaphor I borrowed from a textbook.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I was born and brought up in a farmer's family.&lt;/strong&gt; I watched my father wake before sunrise to tend the fields. I saw firsthand how the food on someone's plate was never the work of one person it was the result of an entire chain of people doing completely different jobs, each depending on the one before them. &lt;strong&gt;The farmer who grew the crop had no idea what the chef would cook.&lt;/strong&gt; &lt;strong&gt;The chef had no idea what the farmer went through to grow it.&lt;/strong&gt; But without both of them doing their part, nobody eats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When I stepped into the world of AI, I kept seeing that same chain&lt;/strong&gt; just dressed in &lt;strong&gt;GPUs&lt;/strong&gt; and &lt;strong&gt;Python&lt;/strong&gt; instead of &lt;strong&gt;soil&lt;/strong&gt; and &lt;strong&gt;seasons&lt;/strong&gt;. The moment I mapped one to the other, everything clicked. And I think it'll click for you too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mental Model: AI Is a Food Supply Chain
&lt;/h2&gt;

&lt;p&gt;Bear with me on this analogy it's more useful than it sounds.&lt;br&gt;
&lt;strong&gt;Consider how food gets to your plate.&lt;/strong&gt; There are &lt;strong&gt;agricultural&lt;/strong&gt; scientists &lt;strong&gt;developing&lt;/strong&gt; &lt;strong&gt;better seeds&lt;/strong&gt;, farmers growing crops at scale, &lt;strong&gt;wholesale distributors packaging and shipping produce,&lt;/strong&gt; and finally &lt;strong&gt;chefs&lt;/strong&gt; who transform &lt;strong&gt;raw ingredients into something people actually want to eat.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI works the same way.&lt;/strong&gt; And once you see it, you can't unsee it.&lt;br&gt;
Every role in the AI ecosystem maps cleanly to a link in this chain  from the researchers inventing new architectures all the way to the engineers shipping products that real users interact with daily. &lt;strong&gt;Let's walk through each layer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 1: The Farm: What ML Engineers Actually Do
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ML Engineers&lt;/strong&gt; &lt;strong&gt;are the farmers of the AI world.&lt;/strong&gt; But before farming even begins, there are &lt;strong&gt;agricultural&lt;/strong&gt; &lt;strong&gt;scientists&lt;/strong&gt;,&lt;strong&gt;researchers&lt;/strong&gt; who invent better &lt;strong&gt;seeds&lt;/strong&gt; and &lt;strong&gt;techniques&lt;/strong&gt;. In AI, those are the &lt;strong&gt;ML Researchers&lt;/strong&gt; the people behind foundational architectures like &lt;strong&gt;Transformers&lt;/strong&gt;, &lt;strong&gt;diffusion models&lt;/strong&gt;, and &lt;strong&gt;attention&lt;/strong&gt; &lt;strong&gt;mechanisms&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ML Engineers&lt;/strong&gt; take those research breakthroughs and make them actually work at scale in the real world.&lt;br&gt;
Core Responsibilities of an ML Engineer&lt;br&gt;
&lt;strong&gt;Their day-to-day involves things like:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrangling massive datasets and building robust data pipelines&lt;/li&gt;
&lt;li&gt;Distributed model training across GPU clusters&lt;/li&gt;
&lt;li&gt;Fine-tuning and optimizing models for inference speed and cost&lt;/li&gt;
&lt;li&gt;Building embeddings, running evaluations, and deploying model APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tools of the Trade
&lt;/h2&gt;

&lt;p&gt;Their toolkit centers on: &lt;strong&gt;PyTorch&lt;/strong&gt;, &lt;strong&gt;TensorFlow&lt;/strong&gt;, &lt;strong&gt;CUDA&lt;/strong&gt;, &lt;strong&gt;MLOps&lt;/strong&gt; platforms, and distributed compute infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Actually Ship
&lt;/h2&gt;

&lt;p&gt;Here's the key thing ML Engineers don't usually ship finished products. What they produce is more like raw infrastructure: trained &lt;strong&gt;models&lt;/strong&gt;, &lt;strong&gt;embeddings&lt;/strong&gt;, &lt;strong&gt;checkpoints&lt;/strong&gt;, &lt;strong&gt;and model APIs&lt;/strong&gt;. Intelligence, packaged and ready to be consumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;They grow the crop. Someone else cooks the meal.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 2 : The Wholesale Market: How AI Gets Distributed
&lt;/h2&gt;

&lt;p&gt;This &lt;strong&gt;middle layer is the one most breakdowns ignore entirely&lt;/strong&gt; and it's far richer than most people realize. It actually has &lt;strong&gt;two distinct aisles.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Aisle One:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Premium Branded Suppliers&lt;/strong&gt; (&lt;strong&gt;OpenAI&lt;/strong&gt;, &lt;strong&gt;Anthropic&lt;/strong&gt;, Google DeepMind) Companies like OpenAI, Google DeepMind, and Anthropic are like Sysco or large branded food distributors. They package frontier intelligence into clean, reliable, &lt;strong&gt;ready-to-use products.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT, Gemini, and Claude APIs&lt;/li&gt;
&lt;li&gt;Embedding APIs, Vision APIs, Speech APIs&lt;/li&gt;
&lt;li&gt;Fully managed infrastructure, built-in safety layers, and enterprise-grade reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't see the supply chain. &lt;strong&gt;You just call an API and get state-of-the-art intelligence in milliseconds&lt;/strong&gt;. Before this existed, you needed a full ML Engineering team just to get a model running in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aisle Two: Open Wholesale Markets (AWS Bedrock, Azure AI Foundry, GCP Vertex AI)
&lt;/h2&gt;

&lt;p&gt;Here's what most breakdowns miss entirely.&lt;br&gt;
Cloud providers &lt;strong&gt;AWS&lt;/strong&gt; Bedrock, &lt;strong&gt;Microsoft&lt;/strong&gt; Foundry, and &lt;strong&gt;Google&lt;/strong&gt; Cloud Vertex AI aren't just resellers of branded models. &lt;strong&gt;They operate more like large wholesale markets&lt;/strong&gt; that carry both premium labels and local, homegrown produce side by side.&lt;/p&gt;

&lt;p&gt;On the same platform, you can access Claude and Llama and Mistral and your own fine-tuned model. One marketplace, every option. This flexibility is exactly what enterprises need when they want control over their models without building full ML infrastructure from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Homegrown Produce: Small Language Models (SLMs)
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;"local vegetables"&lt;/strong&gt; in this analogy are your Small Language Models (SLMs) open source models like Meta's Llama, &lt;strong&gt;Mistral&lt;/strong&gt;, Microsoft's &lt;strong&gt;Phi&lt;/strong&gt;, and Google's &lt;strong&gt;Gemma&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They're leaner, &lt;strong&gt;significantly cheaper to run&lt;/strong&gt;, and crucially organizations can fine tune them on their own proprietary data. That makes them genuinely homegrown in a way GPT-4 never could be. When a company fine-tunes Llama on internal knowledge, the line between consumer and producer blurs. That organization becomes its own farm.&lt;/p&gt;

&lt;h2&gt;
  
  
  vLLM: The Cold-Chain Logistics Truck
&lt;/h2&gt;

&lt;p&gt;If &lt;strong&gt;SLMs are the homegrown vegetables&lt;/strong&gt;, vLLM is the refrigerated truck that makes distribution actually possible.&lt;/p&gt;

&lt;p&gt;It's the open-source inference engine that lets companies serve these models at scale with proper throughput, batching, and latency — without building that infrastructure from scratch. Without &lt;strong&gt;vLLM (or similar tools like Ollama and TGI)&lt;/strong&gt;, your homegrown model stays on the farm. With it, it reaches the kitchen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftk6862hwjfiicktpza4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftk6862hwjfiicktpza4q.png" alt=" " width="793" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3 :The Kitchen: What AI Engineers Actually Do
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If ML Engineers are the farmers, AI Engineers are the chefs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They don't grow the ingredients.&lt;/strong&gt; They take what's &lt;strong&gt;available from any aisle of the wholesale market and turn it into something people actually want to use&lt;/strong&gt;. Their work is less about training models and entirely about building around them intelligently.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Responsibilities of an AI Engineer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;An AI Engineer's world looks like&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt engineering and context window management&lt;/li&gt;
&lt;li&gt;RAG pipelines connected to vector databases&lt;/li&gt;
&lt;li&gt;Agentic workflows and multi-step tool calling&lt;/li&gt;
&lt;li&gt;API orchestration and AI system architecture&lt;/li&gt;
&lt;li&gt;Guardrails, memory systems, and user experience design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools of the Trade&lt;/strong&gt;&lt;br&gt;
Their stack: LangChain, LangGraph, Semantic Kernel, FastAPI, cloud services, and whatever combination of APIs gets the job done fastest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What They Actually Ship
&lt;/h2&gt;

&lt;p&gt;The output isn't a model. It's a product that real users interact with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AI copilot embedded directly into your existing workflow&lt;/li&gt;
&lt;li&gt;An enterprise chatbot that actually understands your business context&lt;/li&gt;
&lt;li&gt;A document intelligence system that reads and reasons over contracts in seconds&lt;/li&gt;
&lt;li&gt;An autonomous agent that handles customer support end-to-end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If ML Engineers build the brain, AI Engineers build the experience.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Biggest Practical Difference Between the Two Roles: Speed
&lt;/h2&gt;

&lt;p&gt;This is where the analogy really earns its keep and it has real implications for how companies should think about building AI teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ML Engineering Moves Slowly
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Training a large model is slow by nature.&lt;/strong&gt; We're talking weeks or months of compute, massive infrastructure costs, careful dataset curation, and iterative &lt;strong&gt;hyper-parameter tuning&lt;/strong&gt;. You cannot pivot overnight &lt;strong&gt;just like a farmer cannot change the harvest mid-season.&lt;/strong&gt; The investment is deliberate and the &lt;strong&gt;feedback loops are long.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Engineering Moves Fast
&lt;/h2&gt;

&lt;p&gt;AI Engineers operate on an entirely different clock.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New prompt strategy? Ship it today.&lt;/li&gt;
&lt;li&gt;Add a new tool to an agent? Done by tomorrow.&lt;/li&gt;
&lt;li&gt;Redesign the entire workflow? Next week.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;It's the difference between farming and running a restaurant kitchen.&lt;/strong&gt; The kitchen adapts constantly. That speed is exactly why AI Engineering adoption is exploding across enterprises right now  companies need to move fast, experiment quickly, and iterate in days, not quarters.&lt;/p&gt;

&lt;h2&gt;
  
  
  So Which Role Is "Better"? (Wrong Question)
&lt;/h2&gt;

&lt;p&gt;There's &lt;strong&gt;sometimes an unnecessary debate about which role is more valuable, more technical, or more future-proof. That framing misses the point entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ecosystem depends on both&lt;/strong&gt;. Without &lt;strong&gt;ML Engineers&lt;/strong&gt;, there are no models to build on. Without &lt;strong&gt;AI Engineers&lt;/strong&gt;, those models never reach the people who need them. One creates intelligence. The other delivers value. Neither works without the other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The future belongs to teams that understand&lt;/strong&gt; how the whole supply chain fits together not to individuals who've picked a side in a debate that shouldn't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Industry Is Heading: Layered Specialization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI is maturing the same way cloud computing did&lt;/strong&gt;. What started as one blurry discipline is rapidly separating into clear, distinct specializations each with its own career path, toolset, and skill ceiling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sgupdbo56jhlmw5too2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sgupdbo56jhlmw5too2.png" alt=" " width="760" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We've seen this pattern play out before.&lt;/strong&gt; Infrastructure engineers gave rise to platform engineers, who enabled application developers, who powered the SaaS era. AI is following the exact same trajectory just faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought: Intelligence Becomes Impact Through the Whole Chain
&lt;/h2&gt;

&lt;p&gt;Better research creates better models. Better models whether frontier APIs or fine-tuned SLMs enable better applications. Better applications create better outcomes for real people.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's the chain. Every link matters. The farmers, the distributors, the logistics trucks, and the chefs all have to show up.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ML Engineers grow the intelligence. AI Engineers cook the experience. Together, they serve the future.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Found this useful? Share it with someone who's still using these terms interchangeably.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>career</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>The Pragmatic Architect’s Guide to Enterprise AI: Balancing Cost, Memory, Context, and Production Reality</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sun, 17 May 2026 06:20:31 +0000</pubDate>
      <link>https://dev.to/sreeni5018/the-pragmatic-architects-guide-to-enterprise-ai-balancing-cost-memory-context-and-production-5cpn</link>
      <guid>https://dev.to/sreeni5018/the-pragmatic-architects-guide-to-enterprise-ai-balancing-cost-memory-context-and-production-5cpn</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Enterprise Generative AI has officially &lt;strong&gt;moved beyond the “cool demo” phase.&lt;/strong&gt; Most organizations can now build a basic chatbot, connect a vector database, and generate answers from static documents. The real challenge begins after that when systems must operate reliably under enterprise scale workloads, unpredictable user behavior, &lt;strong&gt;rising token costs, evolving business data, and strict latency expectations.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where many &lt;strong&gt;GenAI/Agenti-AI&lt;/strong&gt; initiatives struggle. The gap is no longer model capability. &lt;strong&gt;The gap is architecture&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Designing sustainable AI systems is not simply about &lt;strong&gt;choosing the biggest LLM or writing longer prompts.&lt;/strong&gt; Production grade AI requires disciplined engineering around &lt;strong&gt;context&lt;/strong&gt; management, &lt;strong&gt;memory&lt;/strong&gt; &lt;strong&gt;orchestration&lt;/strong&gt;, &lt;strong&gt;retrieval&lt;/strong&gt; optimization, &lt;strong&gt;tool&lt;/strong&gt; governance, observability, &lt;strong&gt;cost&lt;/strong&gt; &lt;strong&gt;aware&lt;/strong&gt; execution, latency reduction, and stateful orchestration.&lt;/p&gt;

&lt;p&gt;In many ways, enterprise AI is becoming &lt;strong&gt;less about prompts and more about&lt;/strong&gt; &lt;strong&gt;distributed systems design for probabilistic computing&lt;/strong&gt;. Here are the architectural principles that consistently separate scalable enterprise AI platforms from expensive prototypes.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Dynamic Model Routing Beats Static Model Binding
&lt;/h2&gt;

&lt;p&gt;One of the earliest mistakes teams make is statically attaching workflows to a single model such as assigning a small model for chat, a large model for coding, and a separate model for summarization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem is simple: users are unpredictable.&lt;/strong&gt; A conversation can instantly shift from a simple greeting ("Hello"), to a highly complex task ("Debug this Kubernetes deployment"), to a structural request ("Summarize this architecture document").&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A statically bound architecture forces a lose lose scenario.&lt;/strong&gt; it either overuses expensive frontier models for trivial work, or it &lt;strong&gt;sends complex reasoning tasks to lightweight models that fail&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern: Intelligent Model Routing&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Instead of binding workflows directly to models&lt;/strong&gt;, introduce a Model Router layer. Platforms like &lt;strong&gt;Microsoft Azure AI Foundry are increasingly embracing this direction by enabling multi model orchestration, advanced routing, automated evaluation, and unified governance instead of forcing enterprises into a rigid, single model strategy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The router dynamically analyzes the prompt's intent, complexity, cost constraints, and latency requirements&lt;/strong&gt; to choose the optimal execution model. This architecture dramatically reduces token spend, latency, and operational over-provisioning while preserving response quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Multi-Turn AI Requires Memory Architecture, Not Chat History Dumps
&lt;/h2&gt;

&lt;p&gt;A surprisingly common anti-pattern is taking the entire raw conversation history and appending it back to the model with every new turn. This creates massive token waste, slower inference, context dilution, and &lt;strong&gt;"lost in the middle" failures.&lt;/strong&gt; Conversely, resetting context every turn destroys conversational continuity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern: Split Memory Architecture&lt;/strong&gt;&lt;br&gt;
Enterprise AI systems must separate memory into distinct, managed layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-Term Memory (STM):&lt;/strong&gt; Tracks the &lt;strong&gt;immediate conversation state&lt;/strong&gt;, active tasks, and localized workflow context. This is implemented using sliding windows, rolling buffers, or real-time summaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-Term Memory (LTM):&lt;/strong&gt; Stores persistent user &lt;strong&gt;preferences&lt;/strong&gt;, historical entities, &lt;strong&gt;prior decisions&lt;/strong&gt;, and &lt;strong&gt;cross-session knowledge&lt;/strong&gt;. This layer is backed by vector databases, graph memory, and structured enterprise stores.&lt;/p&gt;

&lt;p&gt;The objective is not to remember everything; the objective is to retrieve only what matters right now. That distinction changes the entire cost structure of enterprise AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Tool Explosion, Progressive Disclosure, and AgentSkills
&lt;/h2&gt;

&lt;p&gt;Modern enterprise agents frequently integrate with &lt;strong&gt;Jira, ServiceNow, SAP, Salesforce, SharePoint, internal APIs, and Model Context Protocol (MCP) servers&lt;/strong&gt;. A naive implementation exposes every available tool schema directly inside the system prompt.&lt;/p&gt;

&lt;p&gt;This becomes catastrophic at scale. The model spends valuable attention and &lt;strong&gt;token overhead processing massive JSON schemas&lt;/strong&gt;, &lt;strong&gt;unused tools&lt;/strong&gt;, and redundant API signatures instead of focusing on the user’s task. This fragmentation of attention introduces Context Rot, where the model loses focus because its reasoning capabilities are diluted across too many competing instructions and structural definitions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Progressive Tool Disclosure &amp;amp; AgentSkills&lt;/strong&gt;&lt;br&gt;
To prevent tool overload and context degradation from compromising model performance, production systems must adopt a dual-layer strategy that shifts the weight from raw text prompts to dynamic execution boundaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Progressive Tool Disclosure&lt;/strong&gt;&lt;br&gt;
Instead of dumping all tools into the context window upfront, only expose tool schemas that are relevant to the current stage of the active task. As the orchestration layer manages the execution graph, it filters and feeds the model a minimal, highly targeted subset of tools. This minimizes prompt size, context pollution, tool confusion, and hallucinated tool usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentSkills: Procedural Knowledge as Reusable Skills&lt;/strong&gt;&lt;br&gt;
An important evolution in enterprise AI is the shift toward AgentSkills, &lt;strong&gt;where procedural knowledge is abstracted into reusable, executable skill sets rather than static text&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of repeatedly injecting large, verbose step-by-step instructions into system prompts to explain standard enterprise workflows—such as employee onboarding, compliance validation, or ticket processing—you package these workflows as encapsulated, server-side skill abstractions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smaller Initial Prompts:&lt;/strong&gt; The system prompt only needs to reference high-level skill capabilities, radically reducing baseline token consumption.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic Execution:&lt;/strong&gt; By packaging logic into modular skills, you shield the model from processing the underlying boilerplate code or flat API inputs until the skill is actively invoked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal-Driven Task Decomposition:&lt;/strong&gt; Instead of relying on one giant, monolithic prompt to navigate a multi step process, you provide a clear Goal. The orchestration layer breaks this goal into localized tasks, invoking the precise AgentSkills required for each isolated step.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Context Rot Cannot Be Solved with Bigger Prompts
&lt;/h2&gt;

&lt;p&gt;Many teams attempt to solve AI reliability problems by packing more instructions, edge cases, and examples into the prompt. Eventually, the prompt morphs into an unmaintainable specification document. &lt;strong&gt;This causes Context Rot&lt;/strong&gt;. The model loses focus because attention becomes fragmented across too many competing instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern: Goal-Driven Task Decomposition&lt;/strong&gt;&lt;br&gt;
Instead of relying on one giant, monolithic prompt, shift the responsibility to the orchestration layer. Provide the system with a clear Goal, and let the agent and model dynamically decompose that goal into smaller, localized tasks that execute, validate, and continue in isolated loops.&lt;/p&gt;

&lt;p&gt;This approach isolates context, ensures higher reasoning accuracy, reduces hallucination risk, and simplifies observability. Orchestration frameworks such as &lt;strong&gt;LangGraph, Semantic Kernel, and AutoGen become incredibly valuable here&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Observability is Non-Negotiable in Agentic Systems
&lt;/h2&gt;

&lt;p&gt;Traditional applications fail deterministically; agentic systems fail probabilistically. When an AI system hallucinates in production, finding the root cause requires answering a complex question: “Which specific context, tool, memory, or routing decision caused this outcome?”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without deep observability, debugging is nearly impossible. Your core infrastructure must capture&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt versions and LLM execution graphs.&lt;/li&gt;
&lt;li&gt;Exact tool invocation inputs, outputs, and latency metrics.&lt;/li&gt;
&lt;li&gt;Model routing decisions and token consumption.&lt;/li&gt;
&lt;li&gt;Retrieval results, cache hit ratios, and memory fetches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Distributed tracing, prompt telemetry, and agent step replays are no longer optional middleware—they are foundational components of a production-grade stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Vector Databases Need Strategic Thinking
&lt;/h2&gt;

&lt;p&gt;Choosing a vector storage solution solely based on convenience is a common pitfall. While extensions like pgvector can work perfectly fine for small prototypes, enterprise-scale semantic retrieval demands a specialized, highly optimized approach.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Retrieval Pipeline&lt;/strong&gt;&lt;br&gt;
Achieving high-quality Retrieval-Augmented Generation (RAG) is less about the underlying database and more about the architecture of your retrieval pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o9exlab79u6vnezqaqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3o9exlab79u6vnezqaqm.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good retrieval quality comes from a combination of robust chunking strategies, embedding alignment, metadata filtering, cross-encoder re-ranking, and context compression.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Living Documents Need Incremental Vectorization
&lt;/h2&gt;

&lt;p&gt;Enterprise knowledge bases (wikis, policies, contracts, and product catalogs) are constantly evolving. &lt;strong&gt;Re-vectorizing an entire document corpus after every minor update is an operational bottleneck that drains compute resources and drives up embedding costs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Pattern:&lt;/strong&gt; Incremental Embedding Pipelines&lt;br&gt;
Implement deterministic hashing (such as MD5 or SHA-256) on individual document chunks.&lt;/p&gt;

&lt;p&gt;When a document updates, &lt;strong&gt;chunk it and compare the new hashes against your existing vector store.&lt;/strong&gt; You only vectorize and update the specific chunks that have actually mutated. This results in lower embedding costs, faster ingestion, reduced compute usage, and smaller synchronization windows.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Semantic Caching is the Hidden Cost Weapon
&lt;/h2&gt;

&lt;p&gt;Most enterprise prompts are highly repetitive. Users frequently ask similar questions, trigger identical retrieval requests, and run the same automated workflows. Recomputing these identical requests from scratch every time wastes valuable resources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg41zmhlbvuarzrhnuqqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg41zmhlbvuarzrhnuqqo.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;**Dual-Layer Semantic Caching&lt;br&gt;
**To optimize performance, deploy a dual-layer semantic caching strategy that functions as a high-speed, localized vector lookup:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt-Level Cache:&lt;/strong&gt; Intercepts and matches semantically similar incoming user intents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool-Level Cache:&lt;/strong&gt; Intercepts repetitive enterprise API and database calls triggered by agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching can dramatically reduce both latency and token usage. It can be applied to:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt responses&lt;/li&gt;
&lt;li&gt;Retrieval outputs&lt;/li&gt;
&lt;li&gt;Tool-calling results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, semantic caching behaves like a lightweight similarity-based memory layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache invalidation matters&lt;/li&gt;
&lt;li&gt;Stale responses must be avoided&lt;/li&gt;
&lt;li&gt;TTL and refresh policies are critical&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;⚠️ Critical Warning on Cache Invalidation: Caching without proper invalidation is incredibly dangerous. Delivering a stale AI response is often worse than a slow response. You must implement robust Time-To-Live (TTL) policies, event-driven cache invalidation, and business-aware expiration logic to ensure your AI never delivers outdated information with high confidence.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Fine-Tuning is Often Overused
&lt;/h2&gt;

&lt;p&gt;Fine-tuning sounds attractive because it promises to inject domain expertise, reduce prompt sizes, and enforce strict formatting consistency. However, many enterprises underestimate the long-term operational burden, which includes complex dataset curation, model drift management, dedicated GPU costs, ongoing retraining pipelines, and versioning challenges.&lt;/p&gt;

&lt;p&gt;Most importantly, fine-tuned models remain static; they cannot access real-time enterprise data without external retrieval systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Strategic Reality&lt;/strong&gt;&lt;br&gt;
For the vast majority of enterprise use cases, optimizing RAG, implementing semantic caching, refining chunking strategies, and establishing robust memory design delivers a significantly higher ROI than fine-tuning.&lt;/p&gt;

&lt;p&gt;Fine-tuning should be strictly reserved for specialized output formats (like custom JSON structures), highly constrained styling behaviors, domain-specific generation languages, or unique reasoning patterns. Keep the model foundational, and keep the architecture modular.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning introduces additional operational complexity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Curated datasets&lt;/li&gt;
&lt;li&gt;GPU infrastructure&lt;/li&gt;
&lt;li&gt;MLOps / LLMOps pipelines&lt;/li&gt;
&lt;li&gt;Monitoring and evaluation&lt;/li&gt;
&lt;li&gt;Governance and retraining&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Because of this, I generally recommend exhausting higher-ROI optimizations first:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt engineering&lt;/li&gt;
&lt;li&gt;RAG&lt;/li&gt;
&lt;li&gt;Memory&lt;/li&gt;
&lt;li&gt;Routing&lt;/li&gt;
&lt;li&gt;Caching&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning makes the most sense when enterprises require:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly specialized behavior&lt;/li&gt;
&lt;li&gt;Strict response formats&lt;/li&gt;
&lt;li&gt;Domain-specific language&lt;/li&gt;
&lt;li&gt;Consistent deterministic outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  10. Chunking Strategy is More Important Than Most Teams Realize
&lt;/h2&gt;

&lt;p&gt;Many RAG failures are not caused by the model; they are caused by poor chunking. If your chunks are too large, retrieval becomes incredibly noisy. If they are too small, core semantic meaning breaks. If they are poorly structured, the contextual coherence collapses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunking is not merely splitting text based on fixed character counts; it is the art of preserving semantic meaning boundaries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Useful Mental Model: &lt;strong&gt;Chunking is like cutting an elephant into LEGO pieces. The shape of the piece matters just as much as its overall size.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An optimal chunking strategy must explicitly account for document hierarchies, semantic transitions, structural tables, code blocks, headers, and metadata relationships. Optimizing your chunking methodology will almost always yield a greater improvement in retrieval quality than switching to a larger LLM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dusxo8pfkwio08m24f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86dusxo8pfkwio08m24f.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought: Enterprise AI is a Systems Engineering Discipline
&lt;/h2&gt;

&lt;p&gt;The industry initially treated &lt;strong&gt;GenAI/Agentic-AI&lt;/strong&gt; as a &lt;strong&gt;prompt engineering problem.&lt;/strong&gt; Today, it has &lt;strong&gt;clearly evolved into a memory architecture, distributed systems, retrieval engineering, cost optimization, and workflow orchestration challenge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The winning enterprise AI platforms will not necessarily be the ones deploying the largest standalone models. They will be the ones that &lt;strong&gt;build better orchestration, superior memory management, deep observability, resilient retrieval pipelines, and highly optimized context engineering layers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In production AI systems, architecture eventually matters more than prompts.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary Checklist for AI Architects
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpb1867aoj188ovz8ayd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpb1867aoj188ovz8ayd.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>You Were Trained. But Are You Ready to Serve?</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sat, 16 May 2026 14:10:26 +0000</pubDate>
      <link>https://dev.to/sreeni5018/you-were-trained-but-are-you-ready-to-serve-10j4</link>
      <guid>https://dev.to/sreeni5018/you-were-trained-but-are-you-ready-to-serve-10j4</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The gap between building an LLM and running it in production and what it teaches us about our own careers.&lt;/p&gt;

&lt;p&gt;We have all met that person. Top of their class. Brilliant in theory. Deep, encyclopedic knowledge in their field. And yet, somehow, they struggle the moment real work lands on their desk. They freeze when faced with ambiguous problems. They slow down under pressure, failing to deliver at the level everyone expected.&lt;/p&gt;

&lt;p&gt;The world of machine learning has a name for this exact failure mode. It isn’t a training problem. It’s a &lt;strong&gt;serving problem&lt;/strong&gt;. Once you see it through this lens, you will never look at your education, your career, or your daily workflow the same way again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Part 1: You Are a Model.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;College Was Your Training Run.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In machine learning, a model begins as a blank slate an empty architecture with no knowledge, no instincts, and no ability to recognize patterns.&lt;/p&gt;

&lt;p&gt;Then, training begins. The model is fed enormous amounts of data text, images, signals and it fails constantly. Each failure produces an error signal. That error signal flows backward through the network, making tiny adjustments to the model's internal parameters: its &lt;strong&gt;weights and biases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Repeat this process millions of times, and something remarkable happens. The model stops failing randomly and starts recognizing structure. It builds intuition, develops defaults, and becomes capable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is exactly what formal education does to you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every lecture, every textbook chapter, every exam you failed and had to retake, every piece of stinging feedback from a mentor—each one was a &lt;strong&gt;gradient update&lt;/strong&gt;. It was a small error signal flowing back through your thinking, adjusting your internal parameters. Your weights and biases are your professional instincts: how you approach a problem, what tool you reach for first, and how you reason under pressure.&lt;/p&gt;

&lt;p&gt;College built those slowly, painfully, and iteratively. Training was never truly about the grade; it was about adjusting the weights.&lt;/p&gt;

&lt;p&gt;But remember: this phase is long and controlled. The data is curated, and the environment is safe. The answers exist somewhere, and someone is grading your output against them. Training is preparation not performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Part 2: Your Degree Is Your Domain-Specific Fine-Tune.
&lt;/h3&gt;

&lt;p&gt;After general pre-training, machine learning models go through a second phase called fine-tuning. The base model already has broad capabilities it understands language, logic, and patterns. Fine-tuning narrows that capability toward a specific domain.&lt;/p&gt;

&lt;p&gt;A model fine-tuned on medical data learns to reason about symptoms and diagnoses. One fine-tuned on legal documents learns to navigate argument and precedent. It’s the same base architecture, but a completely different specialization.&lt;/p&gt;

&lt;p&gt;Your degree is your fine-tune. You stopped being a general learner and became domain-specific. Your configuration was set, and your weights were adjusted for a particular problem space.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A medical student's parameters are tuned to healthcare.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A software engineer's are tuned to systems and logic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A finance major's are tuned to risk, capital, and market behavior.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the time you walk across that graduation stage, your architecture is locked in. You are no longer a blank model trained on everything broadly; you are a specialized model trained deeply on something specific.&lt;/p&gt;

&lt;p&gt;That is the value your institution produced , and specific is exactly what the real world hires for.&lt;/p&gt;

&lt;p&gt;But here is the thing nobody tells you at graduation: &lt;strong&gt;Fine-tuning is not the finish line&lt;/strong&gt;. It is just the end of the controlled phase. The real test begins somewhere else entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Part 3: Getting the Job Is Deployment.
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;And Deployment Changes Everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In machine learning, when a model finishes training and fine-tuning, it gets deployed into production. This is called &lt;strong&gt;model serving&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The model is now live. Real users send real requests. The environment is absolutely nothing like training. There is no curated dataset, no answer key, and no controlled batch of problems neatly designed to be solvable. There are just requests—unpredictable, varied, and arriving concurrently at any time. The model must handle them fast, reliably, and accurately.&lt;/p&gt;

&lt;p&gt;When you land your first job, you have been deployed. And the rules change completely.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpsd70hn5eyaotqfej5o9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpsd70hn5eyaotqfej5o9.png" alt=" " width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model serving is the most critical phase of the entire pipeline.&lt;/strong&gt; It is where value is actually created not in the research notebook, but in production, under real load, handling requests the model has never seen before.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;A model that trains beautifully but collapses in production is entirely worthless&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gmqaw5683o86rel3926.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gmqaw5683o86rel3926.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: The Uncomfortable Truth of Brilliant People Who Cannot Perform
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;We have all witnessed it:&lt;/strong&gt; the student who aced every exam but freezes the moment a project doesn't fit a known template. The top graduate who cannot handle ambiguity. The deeply knowledgeable professional who always seems behind, overwhelmed, and bottlenecked on every task.&lt;/p&gt;

&lt;p&gt;This is not an intelligence failure, nor is it a lack of knowledge. In machine learning terms, &lt;strong&gt;this is a well-trained model with broken serving infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The weights are good, the training was solid, and the fine-tuning was real. But when the model hit production when unseen requests started arriving in real-time with no answer key the infrastructure around it simply couldn't handle the load. Requests queued up, memory was wasted, and output slowed to a crawl. The model was capable, but the serving layer was not.&lt;/p&gt;

&lt;p&gt;Training quality and serving quality are two completely separate problems. A brilliant model can fail in production, and a brilliant person can fail at work for the exact same reason.&lt;/p&gt;

&lt;p&gt;This is the gap nobody talks about in education. Schools optimize entirely for training quality better lectures, better exams, better grades. Nobody teaches you how to serve. Nobody teaches you how to handle requests you’ve never seen, how to manage your cognitive resources under concurrent load, or how to build the execution infrastructure that turns what you know into what you consistently deliver.&lt;/p&gt;

&lt;p&gt;In machine learning, two frameworks represent exactly this divide. One is built for training and research; the other is built for production serving. Understanding what separates them changes everything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 5: Hugging Face Transformers vs. vLLM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Framework 1: Hugging Face-The Brilliant Student Who Works Alone
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face Transformers is the gold standard for research&lt;/strong&gt;, experimentation, fine-tuning, and prototyping. If you want to load a &lt;strong&gt;state-of-the-art&lt;/strong&gt; model and iterate on an idea, it’s extraordinary.&lt;/p&gt;

&lt;p&gt;But when you take a Hugging Face model and naively deploy it to serve real user traffic, engineering bottlenecks surface fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Static batching:&lt;/strong&gt; It waits for a full batch to assemble before processing. If requests arrive unevenly, the GPU idles, throughput drops, and users wait.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory pre-allocation:&lt;/strong&gt; It pre-allocates a fixed block of GPU memory per request for the maximum possible sequence length, even if the request is short. Most memory is wasted, causing you to run out of memory far too early under real load.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No shared caching:&lt;/strong&gt; If a hundred users start with the same long system prompt, attention states are recomputed a hundred times from scratch with no reuse.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Pipeline Jams:&lt;/strong&gt; A single long generation occupies a batch slot, blocking faster, shorter requests behind it.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Human Equivalent:&lt;/strong&gt; This is the brilliant professional who works deeply but can only handle &lt;strong&gt;one task at a time&lt;/strong&gt;. They take on a problem, give it everything, finish it completely, and &lt;em&gt;then&lt;/em&gt; pick up the next. They never build systems, and they don't document solutions for reuse, so every new project starts from scratch. They are outstanding in a controlled environment, but entirely overwhelmed the moment volume, concurrency, and unpredictability arrive simultaneously.&lt;/p&gt;

&lt;p&gt;Hugging Face isn't wrong—it is perfectly designed for its purpose. The mistake is using a research tool as a production serving engine , assuming that being well-trained is the same as being ready to serve.&lt;/p&gt;

&lt;h3&gt;
  
  
  Framework 2: vLLM The Same Model, Built to Serve
&lt;/h3&gt;

&lt;p&gt;vLLM is an open-source inference engine built with a single purpose: serving large language models in production at scale. It doesn’t change the model’s weights or retrain anything. It takes the exact same model that runs in Hugging Face and serves it in a way optimized for real traffic, memory constraints, and throughput requirements.&lt;/p&gt;

&lt;p&gt;The results are dramatic: the same model, on the same hardware, can achieve &lt;strong&gt;up to 24x higher throughput&lt;/strong&gt; simply because the serving layer was optimized.&lt;/p&gt;

&lt;p&gt;Four core engineering innovations make this possible, and each has a direct equivalent in how high-performing people operate in the real world:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. PagedAttention vs. Focused Attention
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In ML:&lt;/strong&gt; Traditional serving pre-allocates one massive block of GPU memory per request—like reserving an entire hotel floor for a single guest. Most of it sits empty . vLLM's PagedAttention manages the KV cache in small, dynamic, non-contiguous pages. Memory is allocated only as needed and released immediately upon completion, resulting in near-zero waste. This is how vLLM handles dramatically more concurrent traffic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In You:&lt;/strong&gt; High performers do not hold every open project and pending email simultaneously in their active working memory. They "page in" what the current task actually needs, process it, and release it. People who carry everything at once feel constantly busy, but their output is fragmented, slower, and lower quality. Focused attention isn't a soft skill—it’s memory management.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Continuous Batching vs. Pipeline Thinking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In ML:&lt;/strong&gt; Instead of waiting for an entire batch to run to completion (static batching), vLLM uses continuous batching. The moment any slot completes its generation, a new request is slotted in immediately. The GPU is never idle, and throughput skyrockets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In You:&lt;/strong&gt; Effective professionals design their workflow so it never idles. While one deliverable is in review, the next is already in motion. While waiting on a response, another task is being processed. This isn't frantic multitasking; it is deliberate, linear sequencing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. KV Cache Reuse vs. Your Body of Prior Work
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In ML:&lt;/strong&gt; In enterprise applications, requests constantly repeat the same system prompt. Hugging Face recomputes those attention states from scratch every single time . vLLM uses prefix caching to compute those states once, store them, and allow subsequent requests to retrieve the cache instantly instead of recomputing. Latency drops off a cliff.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In You:&lt;/strong&gt; Every problem you have solved and documented, every framework you've built, and every decision log, template, or post-mortem you’ve written down is your personal KV cache. You don't start from scratch on a new task; you retrieve, adapt, and ship. Professionals who never build this cache spend their entire careers recomputing things they solved years ago.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;KV Cache Joke&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35eh0edn235xes6wq292.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F35eh0edn235xes6wq292.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4. High Throughput vs. Output Matching Capability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In ML:&lt;/strong&gt; The combined effect of these innovations means more requests handled per second and a lower time-to-first-token on the exact same hardware. The model didn’t get smarter; the infrastructure got optimized.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;In You:&lt;/strong&gt; This translates to more output per unit of energy. Not by working longer hours or magically knowing more, but by removing the friction between what you are capable of and what you actually deliver.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4bh7knfkmlwia3ev8s3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw4bh7knfkmlwia3ev8s3.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Point: Build the Model.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Then Build the Inference Engine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your education trained you&lt;/strong&gt;. &lt;strong&gt;Your discipline fine tuned you.&lt;/strong&gt; &lt;strong&gt;Your first job deployed you into production&lt;/strong&gt;. Those phases matter deeply, and the years you put into them are real. But they only produced a capable model not an optimized serving layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production is where the game is actually played:&lt;/strong&gt; concurrent demands, zero answer keys, immediate deadlines, and real stakes. This is where training quality stops mattering, and serving infrastructure takes over.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The person who implements &lt;strong&gt;PagedAttention&lt;/strong&gt; (focused, uncluttered cognitive management) processes metrics more clearly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The person who practices &lt;strong&gt;Continuous Batching&lt;/strong&gt; (keeping their pipeline moving safely) delivers consistently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The person who builds a &lt;strong&gt;KV Cache&lt;/strong&gt; (documenting and templating solutions) never wastes time recomputing the past.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hugging Face gets you running; vLLM gets you scaling. Your degree got you deployed, but how you serve is how you are remembered.&lt;/p&gt;

&lt;p&gt;The question was never whether you were trained well enough. The question is whether your infrastructure is ready for production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
