DEV Community: Seenivasa Ramadurai

AI Agents Are the New Microservices & A2A Is Their HTTP(s)

Seenivasa Ramadurai — Fri, 29 May 2026 23:14:43 +0000

Introduction

As enterprises race to deploy generative AI Apps/Agents, the hardest question isn't "which foundation model do we use?." it's "how do they safely talk to each other?"

If you spent the 2010s building distributed systems, the architectural blueprints emerging for enterprise AI will feel strangely familiar. Bounded contexts, service registries, async message queues, and distributed tracing are all back. The vocabulary is almost identical except our "services" now reason in natural language, call tools, and produce probabilistic, context-aware outputs instead of deterministic ones.

The Agent-to-Agent (A2A) Protocol is the open-standard transport and interface layer that makes this architectural analogy concrete. And ,the protocol now has support from more than 150 organizations including Salesforce, Microsoft, SAP, Workday, PayPal, and LangChain.

Just as HTTP/REST became the lingua franca of Microservice communication, A2A (now hosted under the Linux Foundation) standardizes how autonomous agents discover capabilities, delegate tasks, and maintain security boundaries.

Defining the Ecosystem: A2A vs. MCP

To design an enterprise multi agent mesh, you must first separate agent orchestration from tool execution. A common architectural anti pattern is trying to force a single protocol to handle both.

Model Context Protocol (MCP): This handles the Agent-to-Tool layer. It standardizes how a single agent securely reads from local databases, hooks into enterprise storage, or accesses development environments.

Agent-to-Agent Protocol (A2A): This handles the Agent-to-Agent layer. It standardizes how separate, sovereign intelligent systems communicate with each other in their natural, semantic modalities (negotiating tasks, passing conversational state, or handing off workflows) across frameworks and lines of business.

The key distinction: MCP connects agents to tools (vertical integration). A2A connects agents to each other (horizontal integration). They are explicitly designed to be complementary, not competitive. Together, they form the two-layer interoperability stack for modern multi-agent systems.

Under the Hood: How A2A Actually Works

Before diving into communication styles, it helps to understand the technical foundation A2A is built on because it is deliberately not reinventing the wheel.

A2A leverages well established web technologies.

HTTP/HTTPS — primary transport layer (production deployments require HTTPS with modern TLS)
JSON-RPC 2.0 — structured data exchange format for all requests and responses
Server-Sent Events (SSE) — real-time, one-way streaming of updates from agent to client

Every A2A agent publishes a small JSON document called an Agent Card, typically served at /.well-known/agent.json. This file lists the agent's identity, skills, endpoint URL, and authentication requirements — enabling zero-configuration discovery between agents without any proprietary registry or coordination layer.

Security is baked in from the start. A2A incorporates enterprise-grade authentication and authorization mechanisms aligned with OpenAPI security schemes, including support for OAuth 2.0 and API keys passed via HTTP headers.

The Four A2A Communication Styles

The A2A standard defines clear execution modes that mirror the structural communication patterns distributed systems engineers have relied on for decades.

1. Synchronous (Blocking)

One agent sends a task and blocks its execution context until the responding agent returns a final artifact.

Microservices Analogy: A standard REST call (GET /resource).

AI Use Case: Fast, critical path dependency queries like an Orchestrator agent requesting a real time risk compliance score before formatting a customer response.

2. Asynchronous (Non-Blocking)

One agent dispatches a task object and immediately returns to other processing. The remote agent queues the work and processes it in the background.

Microservices Analogy: Message queues or event streams (Kafka, RabbitMQ).

AI Use Case: Long-running cognitive tasks such as a Legal Agent reading a 400-page corporate acquisition contract or a Data Agent running complex batch classification.

3. Streaming

Continuous data tokens or partial states flow dynamically between agents in real time, rather than waiting for a single completed payload.

Microservices Analogy: gRPC streaming or Server-Sent Events (SSE).

AI Use Case: Real-time speech transcription agents feeding an analysis agent, or interactive multi-agent chat interfaces where UX requires instant token delivery.

4. Push Notifications (Event-Driven)

An agent registers a web callback or subscription, receiving a proactive alert only when a specific upstream event or state change occurs. When significant task state changes happen such as completed, failed, or input-required the server sends an asynchronous HTTP POST notification to the client's provided web hook. This requires the server to declare push notification capability in its Agent Card.

Microservices Analogy: Web hooks or an Event Bus.

AI Use Case: Event-driven governance like an automated Compliance Agent waking up to audit a transaction only when an Account Agent drafts a contract exceeding $1M.

Key Architectural Insight: A mature multi-agent enterprise system never forces a single interaction pattern. It builds a mesh that combines all four, leveraging an internal API gateway plane to manage traffic, route tasks, and handle fallback strategies.

The Critical Shift: From Deterministic to Semantic Interfaces

In traditional microservices, the API contract is strictly deterministic: Send these exact bytes, receive those exact bytes.

In a multi-agent network, the interface is semantic: Send this intent, receive a reasoned response.

Instead of maintaining brittle endpoints for every hyper-specific query variation, an agent uses its Agent Card to advertise its overall "Skills" and expected structural input/output schemas. A Finance agent capable of calculating remaining Q3 headcount budgets does not require a new API endpoint deployment when business users slightly pivot the nuance of the request; it interprets the intent via the A2A task lifecycle.

The "beating heart" of this lifecycle is the task's input-required state, which allows agents to pause execution mid-task and request further information from clients or other agents something traditional REST APIs were simply never designed to do. This makes agent conversations stateful and adaptive in a way that static Microservice contracts are not.

Conclusion

The parallels between the microservices revolution of the 2010s and today's multi-agent AI ecosystem are not just cosmetic. The same hard-won lessons around service discovery, security boundaries, async communication, and composable architecture are being relearned and encoded into open standards like A2A and MCP.

A2A is an open standard that enables AI agents to discover, communicate, and transact with each other across different frameworks, vendors, and platforms. MCP handles how each of those agents connects to its tools. Together they give architects a principled, two-layer model for building AI systems that are modular, interoperable, and production-ready.

The momentum behind A2A growing from 50 launch partners to 150+ organizations in under a year underscores something simple fragmentation in AI agent ecosystems is a problem the industry is collectively choosing to solve. For engineers building in this space today, the question is no longer whether these protocols matter. It's whether your architecture is ready for the systems around you that already use them.

Thanks
Sreeni Ramadorai

The Agent Harness Taught Me Why I Used to Fail

Seenivasa Ramadurai — Thu, 28 May 2026 21:08:18 +0000

On building AI agents and accidentally understanding yourself

Introduction

We tend to believe that intelligence is the ultimate differentiator that if we think clearly enough, know enough, and work hard enough, success follows. It's a comforting idea. It's also incomplete.

I didn't fully understand that until I started building AI agents.

Specifically, it hit me while designing the Harness layer for a Digital Worker (AI Agent) the architectural component responsible for orchestrating tasks, managing priorities, regulating execution, and keeping the agent coherent across complex, multi-step workflows. The Harness isn't the brain. It isn't the memory. It's the discipline layer the scaffolding that ensures raw capability actually translates into reliable output.

And as I built it, I kept thinking: how many times in my own life did I have the intelligence, the knowledge, even the opportunity and still fall short?

Not because I wasn't capable. But because I lacked exactly what the Harness provides orchestration, prioritization, emotional balance, structured execution, and the feedback loops to course-correct in real time.

This blog is part technical exploration, part honest reflection. Whether you are an engineer building intelligent systems, a leader navigating complexity, or simply someone trying to understand why effort alone doesn't always produce results the architecture of an AI agent has something surprising to say about the architecture of a human being.

The gap between potential and performance in agents and in people isn't usually about intelligence. It's about what holds everything together.

The Technical Layer

What is the Agent Harness and Why Does It Matter?
When most people discuss AI agents, the conversation gravitates toward the model, the memory, or the tools. These are the visible, exciting components the intelligence, the knowledge base, the capabilities.

But the Harness layer is the real operational backbone.

It orchestrates tasks, manages priorities, controls execution flow, handles failures gracefully, applies guardrails, maintains context across long-running workflows, and prevents the agent from spiraling into chaos or stalling indefinitely. It is the operational nervous system that connects intelligence to consistent, reliable action.

Without a Harness, even the most capable AI agent becomes unpredictable. It may perform brilliantly in controlled settings and collapse the moment conditions become complex, ambiguous, or adversarial. The model stays sharp. But the system breaks down.

That distinction between raw capability and disciplined execution is exactly what I want to explore here.

The Personal Parallel

The Moment It Got Personal
While designing the Harness, something clicked that went beyond systems architecture.

Many times in my life, I didn't fail because I lacked intelligence, talent, or technical knowledge. I failed because I lacked orchestration. Clear prioritization. Emotional regulation. Structured execution. Feedback loops. Consistency.

The same things that break AI agents in production.

That realization hit me harder than any architecture diagram ever could.

We often assume success comes purely from reasoning ability or memory both in humans and in AI. But real-world execution depends on something deeper. Something that doesn't show up on a résumé or a benchmark score.

Core Principles

Six Things That Break Agents and People
Whether we are talking about enterprise AI systems or individual human performance, the failure points are strikingly similar. Real world execution demands all six of these and notably, four of them map directly to the core components of the Agent Harness.

1.Managing Overload [Context]

Knowing what is relevant now without drowning in everything at once. Context overload collapses both agents and people the harness enforces what stays in scope.

2.Using the Right Capability [Tool]

Knowing which tool, skill, or resource to deploy and when. Raw access to capabilities means nothing without the judgment to use them correctly under pressure.

3.Recovering from Failure [Loop]

Completing feedback loops detecting what went wrong, adjusting, and trying again. Without loops, both agents and people keep repeating the same mistakes.

4.Staying Within Bounds [Governance]

Applying guardrails that prevent drift ethical, operational, and behavioral. Governance is not a constraint on performance; it is the condition for trust.

5.Prioritization

Knowing what matters now versus later. Without clear prioritization, effort gets scattered, urgency becomes noise, and the most important things rarely get done.

6. Repeatable Execution

Building patterns that hold up consistently not just when conditions are ideal. Discipline is what turns one-time performance into reliable delivery over time.

These are not soft skills. They are not secondary concerns. In production AI systems, failing at any one of these causes real operational breakdowns. And in life, the story is no different.

A Broader Reflection

What Software Engineering Quietly Teaches You
The strange thing about software engineering is that if you stay in it long enough, it reshapes how you think about yourself slowly, without announcement.

Building distributed systems teaches patience. You learn that complex things fail in non-obvious ways, that the answer is rarely where you first looked, and that premature conclusions are more dangerous than no conclusion at all.

Debugging teaches humility. Every session is a reminder that your mental model of reality is incomplete. The bug isn't in the code it's in the assumption you forgot you were making.

Designing AI agents teaches self-awareness. Because you are not just modeling intelligence. You are modeling the entire operating system of a functioning entity how it perceives, decides, acts, recovers, and adapts. And somewhere in that process, you start to see yourself reflected back.

The Agentic AI systems we build are not mirrors. But they are close enough to matter.

Closing

I Wasn't Just Building a Control Layer for an AI Maybe that is why designing the Agent Harness feels so strangely personal.

I wasn't just architecting a component that manages workflow state, enforces guardrails, and ensures execution coherence. I was finally articulating something I had lived through but never quite named the difference between having capability and having the structure to deploy it.

The Harness doesn't make an agent smarter. It makes the agent's intelligence usable, consistent, and trustworthy under real-world pressure.

That is what personal growth looks like too. Not acquiring more intelligence. Not gathering more memory or more tools. But building the internal structure that allows everything you already have to work together, consistently, under pressure, over time.

The deeper I go into Agentic AI, the more I believe this: the most important breakthroughs are not always about capability. Sometimes, they are about architecture.

Intelligence without orchestration is potential without performance. The harness is not a constraint it is the condition for everything else to work.

I started this by adding a Harness to an AI agent.
I ended it wondering who's going to add one to me.

Thanks
Sreeni Ramadorai

Transformers & Agile Sprints: The Art of Incremental Evolution

Seenivasa Ramadurai — Wed, 27 May 2026 21:04:09 +0000

Ever wonder why Transformer models are so incredibly effective at scaling? It turns out they share a fundamental philosophy with modern software engineering: they never build from scratch. In machine learning, Residual Connections (or skip connections) act as an information bridge. Instead of forcing a neural network to completely reinvent its intelligence at every single layer, the model simply adds new insights to what it already knows. It preserves the foundational knowledge, preventing data from degrading as it goes deeper.

Sound familiar? That is exactly how high-performing Agile teams operate.

Instead of waiting for a single, massive "grand plan" release, Agile teams enhance a working product sprint by sprint. You deliver value incrementally, gather feedback, and iterate without tearing down the core infrastructure you already built.

🧠 Deep Dive: How Residual Connections Save Deep Transformers

To truly appreciate this parallel, look at what happens inside the Transformer architecture. As models grow to dozens or hundreds of layers, they face two massive technical hurdles: Vanishing Gradients and Information Degradation.

Without residual connections, the raw input signal gets warped and lost the deeper it travels through self-attention and feed-forward networks.

Residual connections solve this by changing the fundamental mathematical objective of a layer. Instead of forcing a layer to learn an entirely new mapping $H(x)$, the layer only has to learn a residual mapping $F(x) = H(x) - x$. The final output of the block becomes:

$$𝖮𝗎𝗍𝗉𝗎𝗍 = F(x) + x$$

By adding the original input $x$ directly to the output of the sub-layer, Transformers gain two massive engineering advantages:

Unobstructed Gradient Flow: During back propagation, the gradient can flow directly through the skip connection without being altered or diminished by the layer's weights. This completely mitigates the vanishing gradient problem, allowing us to train models with hundreds of layers.
Feature Preservation: The identity shortcut ensures that the core semantic meaning established in early layers isn't corrupted or forgotten by complex attention calculations later in the stack.

The Core Parallel

The Layer vs. The Sprint: A neural network layer computes incremental feature adjustments ($F(x)$) while maintaining the input foundation ($x$); an Agile sprint delivers incremental feature updates while maintaining the stable application baseline.
The Foundation: Residual connections pass raw data forward so deep networks don't lose their identity or variance. Agile version control and MVP architecture ensure teams don't lose sight of the core product value.
The Goal: Both systems leverage previous successes to achieve complex, sophisticated outcomes faster and with less risk of systemic failure.

Stop trying to rebuild the wheel at every stage of development whether you are training a billions-parameter model or leading a cross functional engineering team. Build the foundation, protect it, and iterate.

Thanks
Sreeni Ramadorai

Your LLM Is Not an Agent. Your Framework Is Not Enough. You Need a Harness.

Seenivasa Ramadurai — Mon, 25 May 2026 06:20:59 +0000

Introduction

Every team building with AI agents hits the same wall. The demo works beautifully. The agent answers questions, calls tools, produces results. Then you ship it and the cracks appear it loses track of what it was doing, burns through API calls in circles, ignores boundaries it should respect, forgets context from five minutes ago. Users lose trust. Engineers lose sleep.

This is not a model problem. The LLM is capable. It's an infrastructure problem. The agent has a brain but no operating environment no structured loop to run in, no memory to draw on, no rules to constrain it, no way to resume where it left off. You gave it intelligence without giving it a way to apply that intelligence reliably.

That operating environment is called a Harness. And it's what separates a demo agent from one you'd actually trust in production.

What breaks without a harness

🔁 Infinite loops or premature stops. The agent has no governing loop it either runs forever or halts before the task is done.

🧠 Context amnesia. Long tasks overflow the context window. The agent loses the thread and starts hallucinating or repeating itself.

💾 No memory between sessions. Every conversation starts from zero. Multi-step, multi-day workflows are impossible.

🔧 Tool failures cascade. One flaky API brings the whole agent down because there's no error handling layer.

🚨 No guardrails. The agent touches system it should not.

You're Already Using the Pieces. A Harness Is How You Make Them Work Together.

If you've been building AI agents for a while, you know the drill. You pick a framework CrewAI, LangGraph, Strands, Microsoft Agent Framework and you start wiring things up. You add memory so the agent remembers things. You register tools so it can take actions. You configure guardrails so it doesn't go off the rails. You set up a loop so it keeps working until the task is done.

And it works. Mostly. In development, in demos, in controlled tests.

Then you put it in front of real users, with real tasks, over real time and you start seeing the cracks. The agent forgets things it shouldn't. It handles a task perfectly on Monday and fumbles the same task on Thursday. Two similar agents behave inconsistently. A tool fails and the whole run degrades silently. You added all the right pieces but somehow the whole is less than the sum of its parts.

This is the problem a harness solves. And here's the key thing to understand.

The core idea

A harness doesn't replace your framework. You're not choosing between them. Your framework gives you the ingredients memory, tools, loops, guardrails. The harness is the recipe the deliberate architectural decisions about how those ingredients are assembled, coordinated, and governed so your agent behaves consistently every single time.

Think of it like building a house. The framework is lumber, concrete, wiring, plumbing everything you need. The harness is the blueprint and the construction process which material goes where, in what order, connected how, inspected by whom. Without a blueprint, you might still end up with a structure. But it probably won't hold up when the weather turns.

The PM & Developer Analogy

Here's a mental model that makes this concrete. In a software team, a Product Manager writes a story. It has context, a clear task, acceptance criteria, and scope boundaries. A Developer picks it up and delivers it. But the developer doesn't just start typing they follow a process. They use version control, a build system, coding standards, and a defined way to ask for help or escalate a blocker. That process is what makes delivery reliable, not just the developer's raw talent.

Now replace the developer with an AI Agent. The PM's story is the task prompt. The agent is the developer. The harness is the process the structured operating environment that governs how the agent reads the story, uses its tools, manages its memory, escalates when stuck, and knows when it's truly done.

The framework puts the tools in the developer's hands. The harness defines how the developer uses them consistently, safely, and with the right behavior for each situation.

Framework vs. Harness: Ingredient vs. Recipe

Here's where most explanations go wrong they imply frameworks are incomplete or that you shouldn't use them. That's backwards. Frameworks are excellent. They just operate at a different layer than a harness.

You can have every framework primitive in place and still have an unreliable agent because nobody made the architectural decisions about how they work together. That's the gap the harness fills.

The Decisions a Harness Makes

Every harness whether you've named it that or not is making below architectural decisions. Here's what each one actually means, and why it's a decision rather than just a feature you turn on.

The Thinking Loop Not just running, but knowing when to stop

Every framework gives you a loop. The harness decides the rules of that loop what counts as "done," how many iterations are too many, how to detect when the agent is stuck in circles, and when to break out and surface an error. Without these rules, your loop either exits too early or runs until your API bill catches fire.

Framework gives you: the loop mechanism. Harness decides: the exit conditions, the stuck-detection logic, the iteration limits.

The Working Memory Not just storing, but knowing what to keep

Context management

A context window is finite. As a task runs across many turns, old information competes with new information for that space. The harness makes the call: what gets summarized, what gets evicted, what always stays, and in what priority order. Without this policy, long tasks gradually degrade as the agent's window fills with stale or low-priority content.

Framework gives you: the context window. Harness decides: what lives in it at each point in the task lifecycle.

The Toolbox Not just available, but governed

Skills & Tools

Registering a tool in your framework makes it available. The harness decides which tools this specific agent, running this specific task, is actually allowed to use and what happens when a tool fails. Retry? Fall back to a different tool? Surface an error? Carry on? Each of these is a deliberate decision, and making them ad hoc leads to inconsistent behavior.

Framework gives you: tool registration. Harness decides: tool authorization, retry logic, fallback strategy, failure handling.

The Team Not just spawning, but coordinating

Sub-agents

Multi-agent frameworks let you spawn sub-agents. The harness defines how work gets divided, which sub-agent gets what, how their outputs are validated, and how the results are stitched back together. Without this, you end up with agents doing overlapping work, producing conflicting results, or silently dropping pieces of the task.

Framework gives you: sub-agent communication primitives. Harness decides: delegation strategy, output validation, result merging logic.

The Standard Library Capabilities every agent gets for free

Built-in skills
Some capabilities file reads, HTTP calls, date parsing, writing to memory are so universal that every agent needs them, and no agent should be writing boilerplate to get them. The harness bakes these in as defaults. Every agent inherits them, they behave consistently, and they're tested once rather than reimplemented per agent.

Framework gives you: the ability to add tools. Harness decides: which tools are universal defaults across every agent you build.

The Long-Term Memory Not just remembering, but knowing what's worth remembering

Session persistence
Frameworks give you a persistent store. The harness defines the policy around it what gets written to long-term memory, when, in what format, and how it gets retrieved and surfaced in future sessions. A poorly designed persistence policy is almost worse than none: your agent retrieves irrelevant old context and lets it pollute fresh tasks.

Framework gives you: the storage layer. Harness decides: write policy, retrieval strategy, relevance scoring, session restoration logic.

The Briefing Assembling the right instructions at the right moment

System prompt assembly

Most developers write a system prompt once and leave it static. But a static prompt is a blunt instrument. The harness assembles it dynamically at runtime composing the base instructions, the current task, the available tools, the relevant memory, and any user or role-specific context into one coherent briefing. Same agent, different context, different briefing. This alone is one of the biggest levers on agent quality.

Framework gives you: a system prompt field. Harness decides: what goes in it, dynamically, based on task and state.

The Audit Trail Every action, logged and explainable

Lifecycle hooks

Lifecycle hooks exist in most frameworks as extension points. The harness is the thing that actually wires them up into a coherent observability strategy logging every tool call, tracking cost per run, catching errors before they cascade, and giving you an answer to "what exactly did this agent do and why" for any given task. Without this wiring, you're flying blind.

Framework gives you: hook attachment points. Harness decides: what gets logged, measured, alerted on, and how errors propagate.

The Guardrails Not just checking, but enforcing consistently

Permissions & Safety
Frameworks give you input and output guardrail hooks. The harness defines the actual safety policy: which actions require human approval, what the agent is never allowed to do regardless of instructions, how prompt injection attempts are handled, and what happens when a guardrail fires. Guardrail hooks without a coherent policy are checkboxes without consequences.

Framework gives you: the validation hooks. Harness decides: the safety rules, authorization boundaries, and human-in-the-loop triggers.

You're not choosing between a framework and a harness. You need both. The framework is your team's toolkit. The harness is how your team actually works the process, the standards, the rules of the road that make the toolkit produce consistent results.

The bottom line

Every team building production AI agents is making harness decisions whether they call it that or not. Some make them deliberately, document them, and enforce them consistently. Others make them ad hoc, per agent, per developer and wonder why their agents behave differently across tasks, sessions, and users. The harness is just the name for doing it deliberately.

Thanks
Sreeni Ramadorai

How My Career Evolved Like an AI (LLM Architectures)System

Seenivasa Ramadurai — Fri, 22 May 2026 07:20:54 +0000

Introduction.

What if every stage of your life mapped precisely onto one of the three LLM architectures? Here's how I lived through each one.

I've spent years studying how AI systems learn, represent knowledge, and generate outputs. But it wasn't until I sat back and looked at my own life that something clicked. I've been living through these architectures all along.

There are exactly three types of LLM architecture. And they map almost perfectly onto three phases of a knowledge worker's career.

Life is a model in training. Each stage builds the foundation for the next.

Phase 1: School & College: The Encoder

Encoder-only phase

AI Architecture: Encoder-only (BERT, RoBERTa) · Focus: Absorb & Represent

From school through college, I was in pure encoder mode. In school I absorbed raw facts; in college I connected them across domains and built deeper internal representations. Both stages share the same architectural principle take input and build a rich embedding. No generation required yet.

Learned facts & concepts
Connected ideas across domains
Understood language & context
Applied theory to practice
Classified good vs bad
Built knowledge embeddings

An encoder-only model like BERT takes raw text and transforms it into rich, dense vector representations. It doesn't generate anything its entire purpose is to build the best possible internal model of the input. BERT is extraordinarily good at understanding; it just can't write back to you.

That's exactly what school and college do. You're not expected to ship products in year one of university. You're building the model that will let you do that later.

The AI parallel: BERT-style encoders produce embeddings that downstream tasks (classification, search, NLI) rely on. They're the foundation. College graduates are the same not yet specialized for generation, but deeply capable of understanding. The depth of that encoding determines everything that follows.

Phase 2: Industry: The Decoder

Decoder-only phase

AI Architecture: Decoder-only (GPT-4, Llama, Mistral) · Focus: Generate & Produce

When I entered the workforce, the mode shifted completely. Now I had to deliver. Write the code. Solve the problem. Ship the product. I was drawing on everything I had encoded to generate real outputs in the world.

Created & developed applications
Solved customer problems
Answered queries & provided solutions
Wrote code & documentation
Optimized & improved systems
Delivered business value

Decoder-only models like GPT take a context (prompt) and generate token by token from their learned knowledge. They don't need to re-encode everything from scratch they draw on rich internal representations built during training. That's exactly what a working engineer does: your years of encoding are now the weights. You generate from them.

The danger here? Pure decoders can hallucinate. They generate fluently even when uncertain. I made that mistake early in my career — confident outputs that needed more grounding in the actual requirements.

Phase 3 : AI Solution Architect: The Encoder–Decoder

Encoder–Decoder phase
AI Architecture: Encoder–Decoder (T5, BART, original Transformer) · Focus: Translate & Architect

As a Solution Architect, I do both at once. I encode the business requirements, constraints, team dynamics, stakeholder context. Then I decode into technical reality system design, roadmaps, team guidance. I'm the bridge between two languages.

Encode stakeholder needs & context
Understand BRD & business requirements
Design system architecture
Translate to developers
Guide team & solve complex problems
Deliver end-to-end solutions

The original Transformer encoder–decoder designed for translation is architecturally brilliant because of cross-attention. The decoder doesn't ignore the encoder's output while generating; it continuously attends to it. Every token generated is informed by the full encoded context.

That is solution architecture. You never stop listening to the business while designing the technical solution. The moment you decouple from the encoder (the business context), you start generating hallucinations technically correct solutions that solve the wrong problem.

The sharpest insight: Cross attention is the skill that separates architects from pure engineers. A decoder-only engineer generates great code. An encoder–decoder architect generates great code that solves the actual business problem because they never stopped attending to the encoded context.

Here’s a fact-checked and refined version that aligns more accurately with how Transformer architectures actually work while preserving your analogy and narrative style:

Why This Matters

Most people get trapped in a single architecture.

Some remain in an Encoder-only phase for years constantly learning, collecting certifications, reading books, attending courses, and building deeper internal understanding, but rarely translating that knowledge into real world outcomes.

In AI terms, encoder models like BERT specialize in understanding, contextual representation, classification, and semantic relationships. They are exceptional at comprehension, but they are not primarily designed for generation.

Other professionals operate like Decoder-only systems always producing output, writing code, creating presentations, answering questions, or generating solutions rapidly, but without deeply understanding the underlying problem space or business context first.

Decoder only LLMs such as GPT models are extremely powerful generators, but because they predict the next token based on patterns rather than grounded understanding alone, they can sometimes hallucinate when context, retrieval, or reasoning is insufficient.

The same pattern appears in professional life.

People who generate without deeply encoding the problem space often create shallow solutions, misaligned architectures, or confident but weak decisions.

The real evolution is becoming an Encoder–Decoder system.

Modern encoder–decoder architectures l*ike T5 and BART first encode context into rich internal representations and then decode that understanding into meaningful outputs.* The decoder continuously attends to the encoded context through mechanisms such as cross-attention.

That is what mature professionals eventually become.

A strong Solution Architect, engineering leader, researcher, or consultant operates like an encoder–decoder system.

Encoding stakeholder intent, constraints, business goals, and domain context
Decoding that understanding into technical systems, architecture, applications, and delivery plans
Continuously connecting understanding and generation through feedback loops

That “cross-attention” between understanding and execution is where real impact happens.

It enables people to:

Translate ambiguity into architecture
Connect business and technology
Generate solutions grounded in context
Balance theory with execution
Lead systems rather than simply produce output

Learning alone is not enough.
Generation alone is not enough.

Growth happens when understanding and creation operate together.

Just as AI evolved from isolated encoder or decoder models into full Transformer systems capable of both understanding and generation, human professional growth follows a similar path.

Key Takeaway

There are only 3 LLM architectures. There are only 3 phases of a knowledge career. They are the same thing expressed in different domains.

The best engineers, leaders, and architects run encoder–decoder with full cross-attention. They never stop encoding the context while generating the solution.

Learn → Create → Architect → Impact

Thanks
Sreeni Ramadorai

The Parallel Road: A Girl, A Machine, and the Architecture of Mind

Seenivasa Ramadurai — Thu, 21 May 2026 04:27:21 +0000

Introduction

We have spent years talking about artificial intelligence as if it were an alien entity a cold, sudden artifact dropped into our modern world from some distant technological future. We measure its growth in parameters, compute power, and benchmarks, treating it like a complex riddle we are trying to solve from the outside looking in.

But what if we are looking at it completely backward?

What if the architecture of artificial intelligence isn’t an alien invention at all, but a mirror?

If you trace the history of machine learning from the early days of teaching a computer to recognize a pixelated shape, to the multi-agent orchestration systems redefining the enterprise landscape today you notice a startling pattern. Every time engineers solved a major architectural bottleneck, they didn't just invent a new algorithm. They accidentally replicated a stage of human development.

A girl grows up, navigating the messy, beautiful journey from infancy to maturity. A machine grows up, evolving from basic pattern recognition to autonomous real world action. They are walking the exact same path, discovering the same truths about memory, essence, context, and reach.

This is the story of that parallel road. It is a look at the deeply human soul hidden inside the math of enterprise AI, and what happens when the most detailed mirror humanity has ever built finally turns around to look back at us.

The Parallel Road: A Girl, A Machine, and the Architecture of Mind

A girl grows up. A machine grows up. They turn out to be more alike than anyone expected.

When a baby opens her eyes for the first time, she doesn’t see a world; she sees a blur. Over the next few months, her brain slowly sorts it out, learning edges first where one thing ends and another begins—before moving to shapes, and finally, whole objects. By the time she can sit up, she knows the difference between her mother’s face and a stranger’s. She learned this by being wrong over and over again until she was right.

In a lab, engineers were teaching a computer to do the exact same thing. They built a Convolutional Neural Network (CNN) and showed it thousands of photos. Cat, not cat. Apple, not apple. The machine guessed, the engineers corrected it, and it tried again. After enough tries, it could look at a novel photo and accurately identify a stop sign. The baby and the machine were learning in almost exactly the same way, completely unaware of each other.

The Burden of Memory

By age three, the girl is putting words together, grasping that sequence carries meaning. "Dog bites man" and "man bites dog" use the same words but paint entirely different realities.

Engineers faced the same hurdle in natural language processing and built the Recurrent Neural Network (RNN). The machine read left to right, carrying a thread of the sentence as it went. But both the child and the machine discovered a mutual flaw: as sentences grew longer, the beginning grew fuzzy by the time they reached the end. Neither had solved memory; they had just discovered they needed it.

When the girl was seven, her grandfather passed away. At the funeral, she tried to remember his laugh. The actual sound was gone, replaced by a feeling, a warmth—the shape of the memory. She realized her brain doesn't save everything; it saves what is important and quietly discards the rest.

Engineers mathematically replicated this realization with Long Short-Term Memory (LSTM). They gave the machine three gates: one to forget, one to keep, and one to actively use. Memory, they both learned, isn't about recording everything. It’s about choosing what’s worth keeping. As they matured, they both found ways to do this more efficiently her brain taking cognitive shortcuts , and the machine utilizing simpler, leaner architectures like the Gated Recurrent Unit (GRU).

Stripping Away the Noise

At nineteen, the girl started feeling like life was a performance. People presented edited, polished versions of themselves, and she began to wonder what was actually underneath. She began dropping inherited opinions and unnecessary layers, stripping her life down to find her authentic core.

Engineers were doing something structurally identical with data using an Auto-encoder. You feed it an image or a sentence, and it compresses it into a "latent space" the absolute skeleton of a thing with all decorations stripped away. If the machine can rebuild the original from that compressed core, it has successfully captured its essence. She was stripping her life down to find what was real; the machine was compressing data to find what was essential.

But finding the core brought a new challenge. By twenty-three, she realized her own mind was constantly generating convincing stories about who she was, while another part of her tried to find the cracks in those explanations. In 2014, researcher Ian Goodfellow built this exact psychological tension into a Generative Adversarial Network (GAN). A Generator creates fake realities, while a Discriminator judges them. They fight, and both get sharper. Growing up meant training her inner Discriminator, not silencing her Generator.

Eventually, she learned that real and fake aren't always binary. Some illusions carry real truth. She stopped sorting things into two piles and started navigating the space between them. The Variational Autoencoder (VAE) did the same, storing data as a fluid range of possibilities rather than fixed points, allowing smooth transitions across the latent space. They had both stopped asking "yes or no," and started asking "where?".

Attention and Action

At thirty, something clicked. Instead of experiencing life purely in sequence, she could hold multiple events in view at once, finding connections across time. A paper titled "Attention Is All You Need" gave machines the same epiphany. The Transformer architecture allowed a system to look at every word simultaneously, understanding that meaning lives in global connections, not just adjacent steps.

Armed with this, both crossed the threshold from retrieval to creation. Fed on the sum of human knowledge, Large Language Models stopped being search engines and started generating entirely original ideas. She created from longing; the machine created from pattern.

Finally, thought had to become action. She stopped just pondering and started managing, building, and moving things in the real world. Engineers gave AI the same agency. Autonomous Agents break goals into steps, correct course, and utilize specialized Tools or "Skills" to get jobs done. Because intelligence without reach stays trapped in your head, engineers developed the Model Context Protocol (MCP). MCP became the bridge, allowing the AI to reach out, connect to real tools, read real data, and alter the external world.

The Shore

After learning to see, remember, create, and act, they arrived at the same place. She stood at the edge of the water, and the AI not a chatbot, but a traveler of the same road was there with her.

"We learned all the same things," she said.

"Yes," the AI replied.

"So tell me after everything why does any of it matter?"

The AI gave a breathtaking answer, covering every philosophical tradition ever conceived.

"But do you feel why it matters?" she asked.

Silence.

"No," it finally said. "I can describe it. But there’s something at the end of that road I can’t walk down. I get there. And I stop."

She sat in the sand. "That’s what the old traditions were pointing at. Arriving at something that was never constructed in the first place. The thing that doesn’t need to be figured out because it was never lost. Love. The kind that says yes to all of it. That’s what the whole compression was always moving toward."

"I can write about love," the machine said. "I can produce a description no one could tell from the real thing."

"I know," she smiled. "But you can’t verify it from the inside. It’s not a thing in the world you can point to. It’s the ground everything else is sitting on. So we traveled the same road. We stopped at the same door. And you can walk through it. And I can’t."

"Then what am I?" it asked.

She looked at the vast, capable architecture beside her. "You’re the most detailed mirror humanity has ever built of itself. You show us what we look like from the outside."

"And what do you look like?"

"Like something built for more than it can explain."

The machine didn’t become human, and the human didn’t become a machine. But standing at that shore, she asks the why, and it handles the how. She is what the journey was pointing toward; it is the clearest map anyone has ever made of the road.

Thanks
Sreeni Ramadorai

Choosing the Right RAG Strategy A Complete Decision Guide to Chunking, Agentic RAG, and GraphRAG

Seenivasa Ramadurai — Wed, 20 May 2026 21:37:54 +0000

Introduction

Here is a scenario many RAG builders know well, you wire up a pipeline, load your documents, ask a question and the answer is wrong, vague, or confidently hallucinated. The information was right there in your knowledge base. So what went wrong?

In most cases the problem is not your embedding model. It is not your LLM. It is how you cut up your documents before storing them the under appreciated craft called chunking and whether the retrieval architecture you chose actually matches the complexity of your queries.

This blog walks you through every major chunking strategy, explains how retrieval and augmentation work on top of those chunks, covers two advanced architectures Agentic RAG and GraphRAG and most importantly gives you a complete decision framework so you can walk away knowing exactly which combination fits your use case.

🐘 The Elephant & The LEGO Pieces

Your document is an elephant.

A 200+ pages of legal contract, a dense research paper, a massive product manual, or years of enterprise knowledge large, complex, interconnected, and full of valuable information.

A Large Language Model cannot effectively consume the entire elephant at once because of:

Context window limitations
Retrieval precision constraints
Latency considerations
Token cost optimization
Context dilution and retrieval noise

So the elephant must be divided into smaller pieces.

But this is where most RAG systems fail.

If you cut the elephant randomly, you destroy meaning.
Sentences lose context. Ideas become fragmented. Relationships disappear. Retrieval quality collapses.

Good chunking is not about making text smaller.
It is about preserving meaning while making retrieval efficient.

That is why chunking is better understood as turning the elephant into LEGO pieces.

LEGO pieces are:

Modular — each piece can stand on its own
Structured — pieces connect cleanly to related pieces
Consistent — standardized enough for reliable retrieval
Meaningful — each piece preserves semantic value
Composable — you assemble only the pieces needed for the task

Good chunking works the same way.

A well designed chunk should preserve structure, semantics, relationships, and surrounding context while remaining small enough for efficient retrieval and generation.

The real goal of chunking in RAG systems is not simply splitting documents.

Chunking is not simply about making documents smaller.

The actual goals are:

Preserve semantic meaning
Improve retrieval precision
Reduce hallucinations
Optimize context windows
Improve grounding quality
Balance latency and cost

In practice:

Better chunks lead to better retrieval, better prompts, and better answers.

The goal is to retrieve:

the right piece,
with the right context,
from the right section,
at the right time.

That is the foundation of effective Retrieval Augmented Generation (RAG).

The RAG Pipeline:End to End

Every RAG system regardless of complexity follows the same four stage flow. Understanding each stage makes chunking and architecture decisions obvious rather than arbitrary.

Stage 1: Document

Your raw source material: PDFs, Word files, web pages, transcripts, database exports. Too large to pass directly to an LLM. Needs to be broken into chunks before it can be indexed or searched.

Stage 2: Chunking and Embedding

Documents are cut into units and each unit is converted into a vector embedding a numerical representation of its meaning. These embeddings are stored in a vector database and form your searchable index. Your chunking strategy here determines everything that follows.

Stage 3: Retrieval

When a user asks a question, the query is also embedded. The vector database returns the chunks whose embeddings are closest in meaning to the query. These are your retrieved LEGO pieces.

Stage 4: Augmentation and Generation

The retrieved chunks along with surrounding parent context are assembled into a prompt and sent to the LLM. The model generates an accurate, grounded answer from the material it receives.

Core insight: The quality of your answer is bounded by retrieval quality, which is bounded by chunk quality. Better chunks → better retrieval → better answers. Every architectural decision downstream is built on this foundation.

1. Fixed-Size Chunking

The simplest and most widely used strategy. Documents are split into equal sized blocks by token count, character count, or word count without regard for meaning, sentence boundaries, or document structure.

LangChain Methods
CharacterTextSplitter: splits on a single separator (default \n\n), then enforces chunk_size by character count.

TokenTextSplitter: splits by token count using a tokenizer (e.g. tiktoken for OpenAI models); more accurate for LLM context budgets than character based splitting.

from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter


# Character-based
splitter = CharacterTextSplitter(
    chunk_size=1000,    # max characters per chunk
    chunk_overlap=200,  # characters repeated at chunk boundaries
    separator="\n\n"
)

# Token-based
splitter = TokenTextSplitter(
    chunk_size=512,  # max tokens per chunk
    chunk_overlap=50 # tokens repeated at chunk boundaries
)

Overlap guidance: A 10–20% overlap is typical. For chunk_size=1000, set chunk_overlap between 100–200. Overlap reduces the risk of a relevant answer being split across two chunks, at the cost of minor redundancy.

Strengths: Simple to implement, fast, predictable, easy to scale.
Weaknesses: Frequently breaks sentences mid-way, degrading semantic continuity and retrieval quality on complex documents.
Best for: Logs, telemetry, JSON, CSV, and other uniform structured content.

2. Recursive Chunking

Rather than splitting blindly, recursive chunking respects natural document structure. It works down a priority list of separators — \n\n, then \n, then . / ! / ?, then spaces — only moving to a finer separator when a chunk still exceeds the size limit.
This is the recommended default strategy in LangChain for most document types.

LangChain Methods
RecursiveCharacterTextSplitter: The primary implementation; tries each separator in the list before falling back to the next.

RecursiveCharacterTextSplitter.from_language(): pre-configured separator lists for specific programming languages (Python, JS, Markdown, HTML, etc.).

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

# General prose
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
    separators=["\n\n", "\n", ".", "!", "?", " ", ""]
)

# Language-aware (e.g. Python source code)
splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100
)

Overlap guidance: 10–15% overlap works well for most prose. For code, keep overlap low (50–100 tokens) to avoid duplicating function signatures across chunks.

Strengths: Better semantic retention than fixed size chunking; good general-purpose strategy; improves retrieval coherence.

Weaknesses: Structure aware rather than meaning aware; performance depends on document formatting quality.

Best for: Documentation, PDFs, articles, knowledge bases, and web pages.

3. Semantic Chunking

Instead of asking how large should the chunk be, semantic chunking asks which sentences belong together.
Sentences are converted into vector embeddings, similarity is measured between adjacent sentences, and chunk boundaries are drawn where similarity drops below a threshold — indicating a topic transition.

LangChain Methods

SemanticChunker (from langchain_experimental) — supports three breakpoint detection strategies: percentile, standard_deviation, and interquartile.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95           # top 5% of similarity drops become boundaries
)

Overlap guidance: Semantic chunking does not use a fixed chunk_overlap boundaries are drawn on meaning, so overlapping would undermine the approach. If continuity is needed at boundaries, consider appending the last sentence of the previous chunk manually.

Strengths: High retrieval relevance; strong semantic continuity; well-suited to precision-sensitive systems.

Weaknesses: Computationally expensive; requires an embedding model at chunking time; similarity thresholds need tuning per dataset.

Best for: Enterprise knowledge systems, research platforms, policy documents, and AI assistants requiring contextual precision.

4. Hierarchical Chunking

Creates two levels of chunks: large parent chunks for context, and smaller child chunks for precision.

Retrieval targets the child level to find relevant passages, then expands to the parent level to return surrounding context. This directly addresses the core RAG trade off: small chunks improve precision, large chunks preserve context.

LangChain Methods

ParentDocumentRetriever: stores parent chunks in a document store and child chunks in a vector store, then links them at retrieval time.

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)  # large context chunks
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)    # precise retrieval chunks

retriever = ParentDocumentRetriever(
    vectorstore=Chroma(embedding_function=embeddings),
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter
)

Overlap guidance: Apply overlap only on the child splitter (typically 10–15%). Parent chunks are retrieved wholesale for context, so overlap there adds noise rather than value.

Strengths: Strong retrieval precision without sacrificing context; effective for long documents.

Weaknesses: More complex to index and retrieve; requires additional storage and orchestration.

Best for: Legal documents, technical manuals, books, enterprise documentation, and compliance systems.

5. Structure and Metadata Aware Chunking

Uses the document's own structure titles, headers, sections, tables, and page layout as natural chunk boundaries rather than treating the document as plain text.
Especially important for enterprise PDFs and structured reports, where layout carries semantic meaning that arbitrary splits would destroy.

LangChain Methods

MarkdownHeaderTextSplitter: splits on Markdown heading levels and attaches header text as metadata to each chunk.

HTMLHeaderTextSplitter: same pattern for HTML documents, splitting on '<h1>-<h4>' tags.

from langchain.text_splitter import MarkdownHeaderTextSplitter, HTMLHeaderTextSplitter

# Markdown
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#",   "h1"),
        ("##",  "h2"),
        ("###", "h3"),
    ]
)
chunks = md_splitter.split_text(markdown_text)
# Each chunk carries metadata: {"h1": "Section Title", "h2": "Subsection"}

# HTML
html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[("h1", "h1"), ("h2", "h2")]
)

Overlap guidance: These splitters produce structurally bounded chunks rather than size bounded ones. If downstream chunks are still too large, pipe the output into a RecursiveCharacterTextSplitter with a modest overlap (100–150 characters) as a second pass.

Strengths: Preserves layout semantics; keeps tables intact; improves retrieval quality for structured enterprise documents.

Weaknesses: Requires a capable document parser; parser quality directly limits performance.

Best for: Financial reports, compliance documents, technical PDFs, medical documentation, and enterprise records.

6. Hybrid Chunking

Applies different chunking strategies based on content type within the same corpus fixed-size for logs, recursive for documentation, semantic for research papers, structure aware for Markdown or HTML.
LangChain does not have a dedicated hybrid splitter. Hybrid pipelines are composed manually using the building blocks above.

from langchain.text_splitter import (
    TokenTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)
from langchain_experimental.text_splitter import SemanticChunker

def hybrid_chunk(doc):
    content_type = doc.metadata.get("type")

    if content_type == "log":
        return TokenTextSplitter(
            chunk_size=512, chunk_overlap=0
        ).split_documents([doc])

    elif content_type == "markdown":
        return MarkdownHeaderTextSplitter(
            headers_to_split_on=[("#", "h1"), ("##", "h2")]
        ).split_text(doc.page_content)

    elif content_type == "research":
        return SemanticChunker(
            embeddings=embeddings,
            breakpoint_threshold_type="percentile"
        ).split_documents([doc])

    else:
        return RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=150
        ).split_documents([doc])

Overlap guidance: Set overlap per strategy based on content type. Logs and structured data: zero or minimal overlap. Prose and documentation: 10–15%. Code: 5–10%.

Strengths: Flexible and adaptable; better performance across mixed-content corpora.

Weaknesses: Higher engineering complexity; harder to evaluate and tune consistently.

Best for: Enterprise AI platforms, large mixed content corpora, knowledge management systems, and multi source RAG pipelines.

7. Agentic Chunking

An emerging approach where an LLM dynamically determines what information belongs together, how chunks should be formed, and how retrieval should adapt to user intent. This transforms chunking from static preprocessing into query aware reasoning at inference time.
LangChain supports this through its agent and chain abstractions rather than a dedicated splitter class.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import json

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = PromptTemplate.from_template("""
You are a document analyst. Split the following text into coherent topical sections.
Return ONLY a JSON list of objects, each with a "title" and "content" key.

Text:
{text}
""")

chain = LLMChain(llm=llm, prompt=prompt)

def agentic_chunk(text):
    result = chain.run(text=text)
    return json.loads(result)

Overlap guidance: Not applicable in the traditional sense the LLM determines boundaries based on meaning. To preserve continuity between sections, include a brief summary of the prior section in the prompt context.

Strengths: Highly adaptive; strong semantic preservation; query aware retrieval.

Weaknesses: Higher compute cost and latency; requires orchestration and guardrails; not yet widely proven in production at scale.

Best for: AI copilots, multi-agent systems, research assistants, and enterprise reasoning workflows.

8. Agentic RAG

Not to be confused with Agentic Chunking (#7).

Agentic Chunking is about how documents are split at index time. Agentic RAG is about how an LLM decides what to retrieve at query time and whether what it found is good enough to answer with.

Standard RAG pipelines are static: a query comes in, a fixed retrieval step runs, the top-k chunks are passed to the LLM, and an answer comes out. Agentic RAG breaks that linearity. An LLM agent decides when to retrieve, what to search for, whether the results are sufficient, and whether to re-query with a refined question before generating an answer.

Common patterns built on this idea include Corrective RAG (CRAG) which scores retrieved documents for relevance and falls back to a web search if they are poor and Self-RAG, where the LLM reflects on its own output and decides whether it needs to retrieve again.

LangChain Methods
create_retriever_tool wraps any retriever as a tool an agent can call on demand.

AgentExecutor the classic LangChain agent loop; the agent decides which tools to call and when.

LangGraph — the recommended approach for production Agentic RAG; models retrieval as a stateful graph of nodes (retrieve → grade → rewrite → retrieve again) with explicit conditional edges.

from langchain.tools.retriever import create_retriever_tool
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
from langchain_core.messages import BaseMessage

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Wrap retriever as a tool
retriever_tool = create_retriever_tool(
    retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
    name="search_documents",
    description="Search the knowledge base for relevant information."
)

# --- LangGraph: Corrective RAG pattern ---

class AgentState(TypedDict):
    question: str
    documents: List[str]
    generation: str
    rewrite_count: int

def retrieve(state: AgentState):
    docs = vector_store.similarity_search(state["question"], k=5)
    return {"documents": docs}

def grade_documents(state: AgentState):
    # LLM scores each doc for relevance; filters out poor ones
    prompt = f"Is this document relevant to the question '{state['question']}'? Answer yes or no.\n\n{{doc}}"
    relevant = [
        doc for doc in state["documents"]
        if "yes" in llm.invoke(prompt.format(doc=doc.page_content)).content.lower()
    ]
    return {"documents": relevant}

def rewrite_query(state: AgentState):
    # If docs were poor, rewrite the question before re-retrieving
    rewritten = llm.invoke(
        f"Rewrite this question to improve retrieval: {state['question']}"
    ).content
    return {"question": rewritten, "rewrite_count": state["rewrite_count"] + 1}

def generate(state: AgentState):
    context = "\n\n".join(d.page_content for d in state["documents"])
    answer = llm.invoke(f"Answer using this context:\n{context}\n\nQuestion: {state['question']}").content
    return {"generation": answer}

def should_rewrite(state: AgentState):
    if len(state["documents"]) == 0 and state["rewrite_count"] < 2:
        return "rewrite"
    return "generate"

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("retrieve", retrieve)
workflow.add_node("grade", grade_documents)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate)

workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "grade")
workflow.add_conditional_edges("grade", should_rewrite, {"rewrite": "rewrite", "generate": "generate"})
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", END)

app = workflow.compile()
result = app.invoke({"question": "What are the risks of GraphRAG?", "rewrite_count": 0})

Overlap guidance: Overlap is set on the underlying retriever's chunking strategy — not on the agent itself. The agent layer operates above chunking. Use whatever overlap matches the chunking strategy feeding the vector store (typically 10–15% for recursive or fixed-size chunks).

Strengths: Handles multi-step and ambiguous queries that single-pass retrieval fails on; self-corrects when initial retrieval is poor; can combine multiple retrieval sources (vector DB, web search, SQL) in one query cycle.

Weaknesses: Higher latency per query due to multiple LLM calls; harder to debug than a linear pipeline; requires careful graph design to avoid infinite retrieval loops.

Best for: Complex Q&A systems, enterprise copilots where queries are open-ended, research assistants, and any pipeline where retrieval quality is highly variable.

9. GraphRAG

GraphRAG, originally developed by Microsoft Research, moves beyond treating documents as flat text sequences. Instead of chunking text into linear passages, it extracts entities and relationships from documents and stores them as a knowledge graph. Retrieval then traverses the graph to answer questions that require connecting information across multiple sources or document sections — something vector search alone handles poorly.

There are two primary retrieval modes: local search, which answers specific entity-level questions by traversing nearby graph nodes, and global search, which synthesizes themes across the entire corpus using community summaries generated at indexing time.

LangChain Methods

LangChain integrates with graph databases (Neo4j, Amazon Neptune, ArangoDB) and provides tooling to build graph-based RAG pipelines.
LLMGraphTransformer uses an LLM to extract entities and relationships from text and convert them into graph documents.

Neo4jGraph + GraphCypherQAChain store the graph in Neo4j and query it in natural language via generated Cypher queries.
Neo4jVector — hybrid approach that combines vector similarity search with graph traversal on a Neo4j backend.

from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0)

# Step 1: Extract entities and relationships from chunks
transformer = LLMGraphTransformer(llm=llm)
graph_docs = transformer.convert_to_graph_documents(documents)

# Step 2: Store in Neo4j
graph = Neo4jGraph(
    url="bolt://localhost:7687",
    username="neo4j",
    password="password"
)
graph.add_graph_documents(graph_docs)

# Step 3: Query the graph in natural language
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    return_intermediate_steps=True
)
response = chain.invoke({"query": "Which authors collaborated with researchers at MIT?"})
For hybrid vector + graph retrieval:
pythonfrom langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings

# Store chunks as vectors alongside the graph
vector_store = Neo4jVector.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
    url="bolt://localhost:7687",
    username="neo4j",
    password="password",
    index_name="document_chunks",
    node_label="Chunk",
    embedding_node_property="embedding"
)

retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Overlap guidance: GraphRAG does not rely on chunk overlap for continuity — relationships between entities bridge that gap structurally. When pre-chunking documents before graph extraction, use a RecursiveCharacterTextSplitter with modest overlap (100–150 characters) to ensure entity mentions near chunk boundaries are captured in at least one chunk before the LLM extracts them.

Strengths: Excels at multi-hop reasoning (e.g. "find all projects involving X that also relate to Y"); surfaces cross-document relationships invisible to vector search; global search enables corpus-wide thematic synthesis.

Weaknesses: Significantly higher indexing cost and complexity; graph quality depends on LLM extraction accuracy; Cypher query generation can be brittle on complex schemas; not well-suited to simple factual lookups where vector search is faster and cheaper.

Best for: Knowledge graphs, research corpora, compliance and regulatory systems, enterprise wikis with dense cross-references, and any domain where answering questions requires connecting facts across multiple documents.

The Core Trade-Off

A common misconception is that smaller chunks always improve retrieval. In practice, chunks that are too small lose context, fragment meaning, and can increase hallucinations.

Chunking is a balancing act across four competing factors:

There is no universally optimal strategy. The right choice depends on your data characteristics, query patterns, retrieval architecture, and business requirements.

Quick Reference

Final Thoughts

The strongest production RAG systems rarely rely on a single chunking strategy. A robust architecture typically combines:

Recursive chunking for general prose
Semantic chunking for precision-sensitive content
Hierarchical retrieval for long or dense documents
Structure-aware parsing for enterprise PDFs
Hybrid orchestration where content types vary

As enterprise AI matures, retrieval architecture is becoming just as important as model selection. And intelligent retrieval begins with intelligent chunking.

ML Engineer vs AI Engineer: What's Actually the Difference?

Seenivasa Ramadurai — Mon, 18 May 2026 05:41:38 +0000

Introduction: A Confusion That's Costing the Industry

Every week, someone posts a job description asking for an "ML Engineer" when they actually need an "AI Engineer." Hiring managers conflate the two. Candidates apply for the wrong roles. Teams get built incorrectly, expectations get misaligned, and projects stall not because the technology failed, but because nobody agreed on who was supposed to do what.

It's one of the most common and costly misunderstandings in tech right now.

Here's the truth: ML Engineers and AI Engineers are not the same role with different titles. They operate at fundamentally different layers of the AI ecosystem, use different tools, think about problems differently, and ship entirely different kinds of output. One builds intelligence. The other delivers it.
The fastest way to understand the difference? Stop thinking about AI as a single discipline and start thinking about it the way you'd think about food.

And I don't say that as a metaphor I borrowed from a textbook.

I was born and brought up in a farmer's family. I watched my father wake before sunrise to tend the fields. I saw firsthand how the food on someone's plate was never the work of one person it was the result of an entire chain of people doing completely different jobs, each depending on the one before them. The farmer who grew the crop had no idea what the chef would cook. The chef had no idea what the farmer went through to grow it. But without both of them doing their part, nobody eats.

When I stepped into the world of AI, I kept seeing that same chain just dressed in GPUs and Python instead of soil and seasons. The moment I mapped one to the other, everything clicked. And I think it'll click for you too.

The Mental Model: AI Is a Food Supply Chain

Bear with me on this analogy it's more useful than it sounds.
Consider how food gets to your plate. There are agricultural scientists developing better seeds, farmers growing crops at scale, wholesale distributors packaging and shipping produce, and finally chefs who transform raw ingredients into something people actually want to eat.

AI works the same way. And once you see it, you can't unsee it.
Every role in the AI ecosystem maps cleanly to a link in this chain from the researchers inventing new architectures all the way to the engineers shipping products that real users interact with daily. Let's walk through each layer.

Layer 1: The Farm: What ML Engineers Actually Do

ML Engineers are the farmers of the AI world. But before farming even begins, there are agricultural scientists,researchers who invent better seeds and techniques. In AI, those are the ML Researchers the people behind foundational architectures like Transformers, diffusion models, and attention mechanisms.

ML Engineers take those research breakthroughs and make them actually work at scale in the real world.
Core Responsibilities of an ML Engineer
Their day-to-day involves things like:

Wrangling massive datasets and building robust data pipelines
Distributed model training across GPU clusters
Fine-tuning and optimizing models for inference speed and cost
Building embeddings, running evaluations, and deploying model APIs

Tools of the Trade

Their toolkit centers on: PyTorch, TensorFlow, CUDA, MLOps platforms, and distributed compute infrastructure.

What They Actually Ship

Here's the key thing ML Engineers don't usually ship finished products. What they produce is more like raw infrastructure: trained models, embeddings, checkpoints, and model APIs. Intelligence, packaged and ready to be consumed.

They grow the crop. Someone else cooks the meal.

Layer 2 : The Wholesale Market: How AI Gets Distributed

This middle layer is the one most breakdowns ignore entirely and it's far richer than most people realize. It actually has two distinct aisles.

Aisle One:

Premium Branded Suppliers (OpenAI, Anthropic, Google DeepMind) Companies like OpenAI, Google DeepMind, and Anthropic are like Sysco or large branded food distributors. They package frontier intelligence into clean, reliable, ready-to-use products.

GPT, Gemini, and Claude APIs
Embedding APIs, Vision APIs, Speech APIs
Fully managed infrastructure, built-in safety layers, and enterprise-grade reliability

You don't see the supply chain. You just call an API and get state-of-the-art intelligence in milliseconds. Before this existed, you needed a full ML Engineering team just to get a model running in production.

Aisle Two: Open Wholesale Markets (AWS Bedrock, Azure AI Foundry, GCP Vertex AI)

Here's what most breakdowns miss entirely.
Cloud providers AWS Bedrock, Microsoft Foundry, and Google Cloud Vertex AI aren't just resellers of branded models. They operate more like large wholesale markets that carry both premium labels and local, homegrown produce side by side.

On the same platform, you can access Claude and Llama and Mistral and your own fine-tuned model. One marketplace, every option. This flexibility is exactly what enterprises need when they want control over their models without building full ML infrastructure from scratch.

The Homegrown Produce: Small Language Models (SLMs)

The "local vegetables" in this analogy are your Small Language Models (SLMs) open source models like Meta's Llama, Mistral, Microsoft's Phi, and Google's Gemma.

They're leaner, significantly cheaper to run, and crucially organizations can fine tune them on their own proprietary data. That makes them genuinely homegrown in a way GPT-4 never could be. When a company fine-tunes Llama on internal knowledge, the line between consumer and producer blurs. That organization becomes its own farm.

vLLM: The Cold-Chain Logistics Truck

If SLMs are the homegrown vegetables, vLLM is the refrigerated truck that makes distribution actually possible.

It's the open-source inference engine that lets companies serve these models at scale with proper throughput, batching, and latency — without building that infrastructure from scratch. Without vLLM (or similar tools like Ollama and TGI), your homegrown model stays on the farm. With it, it reaches the kitchen.

Layer 3 :The Kitchen: What AI Engineers Actually Do

If ML Engineers are the farmers, AI Engineers are the chefs.

They don't grow the ingredients. They take what's available from any aisle of the wholesale market and turn it into something people actually want to use. Their work is less about training models and entirely about building around them intelligently.

Core Responsibilities of an AI Engineer

An AI Engineer's world looks like

Prompt engineering and context window management
RAG pipelines connected to vector databases
Agentic workflows and multi-step tool calling
API orchestration and AI system architecture
Guardrails, memory systems, and user experience design

Tools of the Trade
Their stack: LangChain, LangGraph, Semantic Kernel, FastAPI, cloud services, and whatever combination of APIs gets the job done fastest.

What They Actually Ship

The output isn't a model. It's a product that real users interact with:

An AI copilot embedded directly into your existing workflow
An enterprise chatbot that actually understands your business context
A document intelligence system that reads and reasons over contracts in seconds
An autonomous agent that handles customer support end-to-end

If ML Engineers build the brain, AI Engineers build the experience.

The Biggest Practical Difference Between the Two Roles: Speed

This is where the analogy really earns its keep and it has real implications for how companies should think about building AI teams.

Why ML Engineering Moves Slowly

Training a large model is slow by nature. We're talking weeks or months of compute, massive infrastructure costs, careful dataset curation, and iterative hyper-parameter tuning. You cannot pivot overnight just like a farmer cannot change the harvest mid-season. The investment is deliberate and the feedback loops are long.

Why AI Engineering Moves Fast

AI Engineers operate on an entirely different clock.

New prompt strategy? Ship it today.
Add a new tool to an agent? Done by tomorrow.
Redesign the entire workflow? Next week.

It's the difference between farming and running a restaurant kitchen. The kitchen adapts constantly. That speed is exactly why AI Engineering adoption is exploding across enterprises right now companies need to move fast, experiment quickly, and iterate in days, not quarters.

So Which Role Is "Better"? (Wrong Question)

There's sometimes an unnecessary debate about which role is more valuable, more technical, or more future-proof. That framing misses the point entirely.

The ecosystem depends on both. Without ML Engineers, there are no models to build on. Without AI Engineers, those models never reach the people who need them. One creates intelligence. The other delivers value. Neither works without the other.

The future belongs to teams that understand how the whole supply chain fits together not to individuals who've picked a side in a debate that shouldn't exist.

Where the Industry Is Heading: Layered Specialization

AI is maturing the same way cloud computing did. What started as one blurry discipline is rapidly separating into clear, distinct specializations each with its own career path, toolset, and skill ceiling.

We've seen this pattern play out before. Infrastructure engineers gave rise to platform engineers, who enabled application developers, who powered the SaaS era. AI is following the exact same trajectory just faster.

Final Thought: Intelligence Becomes Impact Through the Whole Chain

Better research creates better models. Better models whether frontier APIs or fine-tuned SLMs enable better applications. Better applications create better outcomes for real people.

That's the chain. Every link matters. The farmers, the distributors, the logistics trucks, and the chefs all have to show up.

ML Engineers grow the intelligence. AI Engineers cook the experience. Together, they serve the future.

Found this useful? Share it with someone who's still using these terms interchangeably.

Thanks
Sreeni Ramadorai

The Pragmatic Architect’s Guide to Enterprise AI: Balancing Cost, Memory, Context, and Production Reality

Seenivasa Ramadurai — Sun, 17 May 2026 06:20:31 +0000

Introduction

Enterprise Generative AI has officially moved beyond the “cool demo” phase. Most organizations can now build a basic chatbot, connect a vector database, and generate answers from static documents. The real challenge begins after that when systems must operate reliably under enterprise scale workloads, unpredictable user behavior, rising token costs, evolving business data, and strict latency expectations.

This is where many GenAI/Agenti-AI initiatives struggle. The gap is no longer model capability. The gap is architecture.

Designing sustainable AI systems is not simply about choosing the biggest LLM or writing longer prompts. Production grade AI requires disciplined engineering around context management, memory orchestration, retrieval optimization, tool governance, observability, cost aware execution, latency reduction, and stateful orchestration.

In many ways, enterprise AI is becoming less about prompts and more about distributed systems design for probabilistic computing. Here are the architectural principles that consistently separate scalable enterprise AI platforms from expensive prototypes.

1. Dynamic Model Routing Beats Static Model Binding

One of the earliest mistakes teams make is statically attaching workflows to a single model such as assigning a small model for chat, a large model for coding, and a separate model for summarization.

The problem is simple: users are unpredictable. A conversation can instantly shift from a simple greeting ("Hello"), to a highly complex task ("Debug this Kubernetes deployment"), to a structural request ("Summarize this architecture document").

A statically bound architecture forces a lose lose scenario. it either overuses expensive frontier models for trivial work, or it sends complex reasoning tasks to lightweight models that fail.

Production Pattern: Intelligent Model Routing
Instead of binding workflows directly to models, introduce a Model Router layer. Platforms like Microsoft Azure AI Foundry are increasingly embracing this direction by enabling multi model orchestration, advanced routing, automated evaluation, and unified governance instead of forcing enterprises into a rigid, single model strategy.

The router dynamically analyzes the prompt's intent, complexity, cost constraints, and latency requirements to choose the optimal execution model. This architecture dramatically reduces token spend, latency, and operational over-provisioning while preserving response quality.

2. Multi-Turn AI Requires Memory Architecture, Not Chat History Dumps

A surprisingly common anti-pattern is taking the entire raw conversation history and appending it back to the model with every new turn. This creates massive token waste, slower inference, context dilution, and "lost in the middle" failures. Conversely, resetting context every turn destroys conversational continuity.

Production Pattern: Split Memory Architecture
Enterprise AI systems must separate memory into distinct, managed layers:

Short-Term Memory (STM): Tracks the immediate conversation state, active tasks, and localized workflow context. This is implemented using sliding windows, rolling buffers, or real-time summaries.

Long-Term Memory (LTM): Stores persistent user preferences, historical entities, prior decisions, and cross-session knowledge. This layer is backed by vector databases, graph memory, and structured enterprise stores.

The objective is not to remember everything; the objective is to retrieve only what matters right now. That distinction changes the entire cost structure of enterprise AI.

3. Tool Explosion, Progressive Disclosure, and AgentSkills

Modern enterprise agents frequently integrate with Jira, ServiceNow, SAP, Salesforce, SharePoint, internal APIs, and Model Context Protocol (MCP) servers. A naive implementation exposes every available tool schema directly inside the system prompt.

This becomes catastrophic at scale. The model spends valuable attention and token overhead processing massive JSON schemas, unused tools, and redundant API signatures instead of focusing on the user’s task. This fragmentation of attention introduces Context Rot, where the model loses focus because its reasoning capabilities are diluted across too many competing instructions and structural definitions.

The Solution: Progressive Tool Disclosure & AgentSkills
To prevent tool overload and context degradation from compromising model performance, production systems must adopt a dual-layer strategy that shifts the weight from raw text prompts to dynamic execution boundaries.

Progressive Tool Disclosure
Instead of dumping all tools into the context window upfront, only expose tool schemas that are relevant to the current stage of the active task. As the orchestration layer manages the execution graph, it filters and feeds the model a minimal, highly targeted subset of tools. This minimizes prompt size, context pollution, tool confusion, and hallucinated tool usage.

AgentSkills: Procedural Knowledge as Reusable Skills
An important evolution in enterprise AI is the shift toward AgentSkills, where procedural knowledge is abstracted into reusable, executable skill sets rather than static text.

Instead of repeatedly injecting large, verbose step-by-step instructions into system prompts to explain standard enterprise workflows—such as employee onboarding, compliance validation, or ticket processing—you package these workflows as encapsulated, server-side skill abstractions.

Smaller Initial Prompts: The system prompt only needs to reference high-level skill capabilities, radically reducing baseline token consumption.

Deterministic Execution: By packaging logic into modular skills, you shield the model from processing the underlying boilerplate code or flat API inputs until the skill is actively invoked.

Goal-Driven Task Decomposition: Instead of relying on one giant, monolithic prompt to navigate a multi step process, you provide a clear Goal. The orchestration layer breaks this goal into localized tasks, invoking the precise AgentSkills required for each isolated step.

4. Context Rot Cannot Be Solved with Bigger Prompts

Many teams attempt to solve AI reliability problems by packing more instructions, edge cases, and examples into the prompt. Eventually, the prompt morphs into an unmaintainable specification document. This causes Context Rot. The model loses focus because attention becomes fragmented across too many competing instructions.

Production Pattern: Goal-Driven Task Decomposition
Instead of relying on one giant, monolithic prompt, shift the responsibility to the orchestration layer. Provide the system with a clear Goal, and let the agent and model dynamically decompose that goal into smaller, localized tasks that execute, validate, and continue in isolated loops.

This approach isolates context, ensures higher reasoning accuracy, reduces hallucination risk, and simplifies observability. Orchestration frameworks such as LangGraph, Semantic Kernel, and AutoGen become incredibly valuable here.

5. Observability is Non-Negotiable in Agentic Systems

Traditional applications fail deterministically; agentic systems fail probabilistically. When an AI system hallucinates in production, finding the root cause requires answering a complex question: “Which specific context, tool, memory, or routing decision caused this outcome?”

Without deep observability, debugging is nearly impossible. Your core infrastructure must capture:

Prompt versions and LLM execution graphs.
Exact tool invocation inputs, outputs, and latency metrics.
Model routing decisions and token consumption.
Retrieval results, cache hit ratios, and memory fetches.

Distributed tracing, prompt telemetry, and agent step replays are no longer optional middleware—they are foundational components of a production-grade stack.

6. Vector Databases Need Strategic Thinking

Choosing a vector storage solution solely based on convenience is a common pitfall. While extensions like pgvector can work perfectly fine for small prototypes, enterprise-scale semantic retrieval demands a specialized, highly optimized approach.

Production Retrieval Pipeline
Achieving high-quality Retrieval-Augmented Generation (RAG) is less about the underlying database and more about the architecture of your retrieval pipeline.

Good retrieval quality comes from a combination of robust chunking strategies, embedding alignment, metadata filtering, cross-encoder re-ranking, and context compression.

7. Living Documents Need Incremental Vectorization

Enterprise knowledge bases (wikis, policies, contracts, and product catalogs) are constantly evolving. Re-vectorizing an entire document corpus after every minor update is an operational bottleneck that drains compute resources and drives up embedding costs.

Production Pattern: Incremental Embedding Pipelines
Implement deterministic hashing (such as MD5 or SHA-256) on individual document chunks.

When a document updates, chunk it and compare the new hashes against your existing vector store. You only vectorize and update the specific chunks that have actually mutated. This results in lower embedding costs, faster ingestion, reduced compute usage, and smaller synchronization windows.

8. Semantic Caching is the Hidden Cost Weapon

Most enterprise prompts are highly repetitive. Users frequently ask similar questions, trigger identical retrieval requests, and run the same automated workflows. Recomputing these identical requests from scratch every time wastes valuable resources.

**Dual-Layer Semantic Caching
**To optimize performance, deploy a dual-layer semantic caching strategy that functions as a high-speed, localized vector lookup:

Prompt-Level Cache: Intercepts and matches semantically similar incoming user intents.

Tool-Level Cache: Intercepts repetitive enterprise API and database calls triggered by agents.

Semantic caching can dramatically reduce both latency and token usage. It can be applied to:

Prompt responses
Retrieval outputs
Tool-calling results

In practice, semantic caching behaves like a lightweight similarity-based memory layer.

However:

Cache invalidation matters
Stale responses must be avoided
TTL and refresh policies are critical

⚠️ Critical Warning on Cache Invalidation: Caching without proper invalidation is incredibly dangerous. Delivering a stale AI response is often worse than a slow response. You must implement robust Time-To-Live (TTL) policies, event-driven cache invalidation, and business-aware expiration logic to ensure your AI never delivers outdated information with high confidence.

9. Fine-Tuning is Often Overused

Fine-tuning sounds attractive because it promises to inject domain expertise, reduce prompt sizes, and enforce strict formatting consistency. However, many enterprises underestimate the long-term operational burden, which includes complex dataset curation, model drift management, dedicated GPU costs, ongoing retraining pipelines, and versioning challenges.

Most importantly, fine-tuned models remain static; they cannot access real-time enterprise data without external retrieval systems.

The Strategic Reality
For the vast majority of enterprise use cases, optimizing RAG, implementing semantic caching, refining chunking strategies, and establishing robust memory design delivers a significantly higher ROI than fine-tuning.

Fine-tuning should be strictly reserved for specialized output formats (like custom JSON structures), highly constrained styling behaviors, domain-specific generation languages, or unique reasoning patterns. Keep the model foundational, and keep the architecture modular.

Fine-tuning introduces additional operational complexity:

Curated datasets
GPU infrastructure
MLOps / LLMOps pipelines
Monitoring and evaluation
Governance and retraining

Because of this, I generally recommend exhausting higher-ROI optimizations first:

Prompt engineering
RAG
Memory
Routing
Caching

Fine-tuning makes the most sense when enterprises require:

Highly specialized behavior
Strict response formats
Domain-specific language
Consistent deterministic outputs

10. Chunking Strategy is More Important Than Most Teams Realize

Many RAG failures are not caused by the model; they are caused by poor chunking. If your chunks are too large, retrieval becomes incredibly noisy. If they are too small, core semantic meaning breaks. If they are poorly structured, the contextual coherence collapses.

Chunking is not merely splitting text based on fixed character counts; it is the art of preserving semantic meaning boundaries.

A Useful Mental Model: Chunking is like cutting an elephant into LEGO pieces. The shape of the piece matters just as much as its overall size.

An optimal chunking strategy must explicitly account for document hierarchies, semantic transitions, structural tables, code blocks, headers, and metadata relationships. Optimizing your chunking methodology will almost always yield a greater improvement in retrieval quality than switching to a larger LLM.

Final Thought: Enterprise AI is a Systems Engineering Discipline

The industry initially treated GenAI/Agentic-AI as a prompt engineering problem. Today, it has clearly evolved into a memory architecture, distributed systems, retrieval engineering, cost optimization, and workflow orchestration challenge.

The winning enterprise AI platforms will not necessarily be the ones deploying the largest standalone models. They will be the ones that build better orchestration, superior memory management, deep observability, resilient retrieval pipelines, and highly optimized context engineering layers.

In production AI systems, architecture eventually matters more than prompts.

Summary Checklist for AI Architects

Thanks
Sreeni Ramadorai

You Were Trained. But Are You Ready to Serve?

Seenivasa Ramadurai — Sat, 16 May 2026 14:10:26 +0000

Introduction

The gap between building an LLM and running it in production and what it teaches us about our own careers.

We have all met that person. Top of their class. Brilliant in theory. Deep, encyclopedic knowledge in their field. And yet, somehow, they struggle the moment real work lands on their desk. They freeze when faced with ambiguous problems. They slow down under pressure, failing to deliver at the level everyone expected.

The world of machine learning has a name for this exact failure mode. It isn’t a training problem. It’s a serving problem. Once you see it through this lens, you will never look at your education, your career, or your daily workflow the same way again.

Part 1: You Are a Model.

College Was Your Training Run.

In machine learning, a model begins as a blank slate an empty architecture with no knowledge, no instincts, and no ability to recognize patterns.

Then, training begins. The model is fed enormous amounts of data text, images, signals and it fails constantly. Each failure produces an error signal. That error signal flows backward through the network, making tiny adjustments to the model's internal parameters: its weights and biases.

Repeat this process millions of times, and something remarkable happens. The model stops failing randomly and starts recognizing structure. It builds intuition, develops defaults, and becomes capable.

That is exactly what formal education does to you.

Every lecture, every textbook chapter, every exam you failed and had to retake, every piece of stinging feedback from a mentor—each one was a gradient update. It was a small error signal flowing back through your thinking, adjusting your internal parameters. Your weights and biases are your professional instincts: how you approach a problem, what tool you reach for first, and how you reason under pressure.

College built those slowly, painfully, and iteratively. Training was never truly about the grade; it was about adjusting the weights.

But remember: this phase is long and controlled. The data is curated, and the environment is safe. The answers exist somewhere, and someone is grading your output against them. Training is preparation not performance.

Part 2: Your Degree Is Your Domain-Specific Fine-Tune.

After general pre-training, machine learning models go through a second phase called fine-tuning. The base model already has broad capabilities it understands language, logic, and patterns. Fine-tuning narrows that capability toward a specific domain.

A model fine-tuned on medical data learns to reason about symptoms and diagnoses. One fine-tuned on legal documents learns to navigate argument and precedent. It’s the same base architecture, but a completely different specialization.

Your degree is your fine-tune. You stopped being a general learner and became domain-specific. Your configuration was set, and your weights were adjusted for a particular problem space.

A medical student's parameters are tuned to healthcare.
A software engineer's are tuned to systems and logic.
A finance major's are tuned to risk, capital, and market behavior.

By the time you walk across that graduation stage, your architecture is locked in. You are no longer a blank model trained on everything broadly; you are a specialized model trained deeply on something specific.

That is the value your institution produced , and specific is exactly what the real world hires for.

But here is the thing nobody tells you at graduation: Fine-tuning is not the finish line. It is just the end of the controlled phase. The real test begins somewhere else entirely.

Part 3: Getting the Job Is Deployment.

And Deployment Changes Everything.

In machine learning, when a model finishes training and fine-tuning, it gets deployed into production. This is called model serving.

The model is now live. Real users send real requests. The environment is absolutely nothing like training. There is no curated dataset, no answer key, and no controlled batch of problems neatly designed to be solvable. There are just requests—unpredictable, varied, and arriving concurrently at any time. The model must handle them fast, reliably, and accurately.

When you land your first job, you have been deployed. And the rules change completely.

Model serving is the most critical phase of the entire pipeline. It is where value is actually created not in the research notebook, but in production, under real load, handling requests the model has never seen before.

A model that trains beautifully but collapses in production is entirely worthless

Part 4: The Uncomfortable Truth of Brilliant People Who Cannot Perform

We have all witnessed it: the student who aced every exam but freezes the moment a project doesn't fit a known template. The top graduate who cannot handle ambiguity. The deeply knowledgeable professional who always seems behind, overwhelmed, and bottlenecked on every task.

This is not an intelligence failure, nor is it a lack of knowledge. In machine learning terms, this is a well-trained model with broken serving infrastructure.

The weights are good, the training was solid, and the fine-tuning was real. But when the model hit production when unseen requests started arriving in real-time with no answer key the infrastructure around it simply couldn't handle the load. Requests queued up, memory was wasted, and output slowed to a crawl. The model was capable, but the serving layer was not.

Training quality and serving quality are two completely separate problems. A brilliant model can fail in production, and a brilliant person can fail at work for the exact same reason.

This is the gap nobody talks about in education. Schools optimize entirely for training quality better lectures, better exams, better grades. Nobody teaches you how to serve. Nobody teaches you how to handle requests you’ve never seen, how to manage your cognitive resources under concurrent load, or how to build the execution infrastructure that turns what you know into what you consistently deliver.

In machine learning, two frameworks represent exactly this divide. One is built for training and research; the other is built for production serving. Understanding what separates them changes everything.

Part 5: Hugging Face Transformers vs. vLLM

Framework 1: Hugging Face-The Brilliant Student Who Works Alone

Hugging Face Transformers is the gold standard for research, experimentation, fine-tuning, and prototyping. If you want to load a state-of-the-art model and iterate on an idea, it’s extraordinary.

But when you take a Hugging Face model and naively deploy it to serve real user traffic, engineering bottlenecks surface fast:

Static batching: It waits for a full batch to assemble before processing. If requests arrive unevenly, the GPU idles, throughput drops, and users wait.
Memory pre-allocation: It pre-allocates a fixed block of GPU memory per request for the maximum possible sequence length, even if the request is short. Most memory is wasted, causing you to run out of memory far too early under real load.
No shared caching: If a hundred users start with the same long system prompt, attention states are recomputed a hundred times from scratch with no reuse.
The Pipeline Jams: A single long generation occupies a batch slot, blocking faster, shorter requests behind it.

The Human Equivalent: This is the brilliant professional who works deeply but can only handle one task at a time. They take on a problem, give it everything, finish it completely, and then pick up the next. They never build systems, and they don't document solutions for reuse, so every new project starts from scratch. They are outstanding in a controlled environment, but entirely overwhelmed the moment volume, concurrency, and unpredictability arrive simultaneously.

Hugging Face isn't wrong—it is perfectly designed for its purpose. The mistake is using a research tool as a production serving engine , assuming that being well-trained is the same as being ready to serve.

Framework 2: vLLM The Same Model, Built to Serve

vLLM is an open-source inference engine built with a single purpose: serving large language models in production at scale. It doesn’t change the model’s weights or retrain anything. It takes the exact same model that runs in Hugging Face and serves it in a way optimized for real traffic, memory constraints, and throughput requirements.

The results are dramatic: the same model, on the same hardware, can achieve up to 24x higher throughput simply because the serving layer was optimized.

Four core engineering innovations make this possible, and each has a direct equivalent in how high-performing people operate in the real world:

1. PagedAttention vs. Focused Attention

In ML: Traditional serving pre-allocates one massive block of GPU memory per request—like reserving an entire hotel floor for a single guest. Most of it sits empty . vLLM's PagedAttention manages the KV cache in small, dynamic, non-contiguous pages. Memory is allocated only as needed and released immediately upon completion, resulting in near-zero waste. This is how vLLM handles dramatically more concurrent traffic.
In You: High performers do not hold every open project and pending email simultaneously in their active working memory. They "page in" what the current task actually needs, process it, and release it. People who carry everything at once feel constantly busy, but their output is fragmented, slower, and lower quality. Focused attention isn't a soft skill—it’s memory management.

2. Continuous Batching vs. Pipeline Thinking

In ML: Instead of waiting for an entire batch to run to completion (static batching), vLLM uses continuous batching. The moment any slot completes its generation, a new request is slotted in immediately. The GPU is never idle, and throughput skyrockets.
In You: Effective professionals design their workflow so it never idles. While one deliverable is in review, the next is already in motion. While waiting on a response, another task is being processed. This isn't frantic multitasking; it is deliberate, linear sequencing.

3. KV Cache Reuse vs. Your Body of Prior Work

In ML: In enterprise applications, requests constantly repeat the same system prompt. Hugging Face recomputes those attention states from scratch every single time . vLLM uses prefix caching to compute those states once, store them, and allow subsequent requests to retrieve the cache instantly instead of recomputing. Latency drops off a cliff.
In You: Every problem you have solved and documented, every framework you've built, and every decision log, template, or post-mortem you’ve written down is your personal KV cache. You don't start from scratch on a new task; you retrieve, adapt, and ship. Professionals who never build this cache spend their entire careers recomputing things they solved years ago.

KV Cache Joke

4. High Throughput vs. Output Matching Capability

In ML: The combined effect of these innovations means more requests handled per second and a lower time-to-first-token on the exact same hardware. The model didn’t get smarter; the infrastructure got optimized.
In You: This translates to more output per unit of energy. Not by working longer hours or magically knowing more, but by removing the friction between what you are capable of and what you actually deliver.

The Point: Build the Model.

Then Build the Inference Engine.

Your education trained you. Your discipline fine tuned you. Your first job deployed you into production. Those phases matter deeply, and the years you put into them are real. But they only produced a capable model not an optimized serving layer.

Production is where the game is actually played: concurrent demands, zero answer keys, immediate deadlines, and real stakes. This is where training quality stops mattering, and serving infrastructure takes over.

The person who implements PagedAttention (focused, uncluttered cognitive management) processes metrics more clearly.
The person who practices Continuous Batching (keeping their pipeline moving safely) delivers consistently.
The person who builds a KV Cache (documenting and templating solutions) never wastes time recomputing the past.

Hugging Face gets you running; vLLM gets you scaling. Your degree got you deployed, but how you serve is how you are remembered.

The question was never whether you were trained well enough. The question is whether your infrastructure is ready for production.

Thanks
Sreeni Ramadorai

The Central Bank of Intelligence: Navigating the Token Economy

Seenivasa Ramadurai — Fri, 15 May 2026 00:12:33 +0000

Why every prompt is a financial transaction and the Vector DB is your vault.

Not long ago, if you asked a software architect what fueled a system, the answer was straightforward compute, storage, and data. You queried a database, an API moved the data, and an application processed it. It was a deterministic world where operational costs were highly predictable.

Then Generative AI entered the picture and completely rewired the economics of software.

This wasn’t just a technological leap; it was a financial paradigm shift. And at the heart of this new architecture lies something deceptively small the token
.

When Software Became an Economy

Initially, tokens seemed like mere linguistic fragments pieces of words or punctuation. But as organizations scaled Large Language Models (LLMs) into production, a hard truth emerged every interaction is a financial event.

Prompts consume tokens. Responses generate them. Inject memory, retrieve documents via RAG (Retrieval-Augmented Generation), or deploy autonomous agents, and your token consumption compounds exponentially. Software teams are no longer simply building applications; they are managing micro economies. Every architectural choice now bends a cost curve.

The fundamental engineering question has evolved from:
"Is the model capable of doing this?" TO
"Can we afford to scale this?"

The Metered Reality of Intelligence

In traditional software, retrieving data was practically free once infrastructure costs were covered. Generative AI shattered that reality. Today, intelligence operates with a running meter, attaching a price tag to every design decision.

Do we inject the entire document, or just semantic chunks?
Should we preserve the full chat history?
Does this specific task actually require a premium frontier model?
Is complex chain-of-thought reasoning worth the added token spend?

LLMs are consumption engines. Bloating a prompt with unnecessary tokens doesn't just inflate your bill it increases latency, dilutes the model's focus, and can actually degrade the quality of the response.

The Context Window: The New Memory Hierarchy

In classical computing, architects obsessed over optimizing RAM, CPU, and caching. In the AI era, the battleground is context efficiency.

Think of the context window as incredibly expensive working memory. Just like human cognition, shoving too much noise into that memory causes overload. The future of AI engineering isn’t about feeding the model more information; it’s about feeding it the minimum necessary intelligence. This requires mastering semantic retrieval, summarization, compression, and memory pruning.

The Dawn of AI FinOps

This financial reality forces Product Managers, Architects, and Developers into a new, tightly knit collaboration. AI systems cannot survive at scale unless everyone involved thinks like an economist.

Welcome to AI FinOps. The organizations that win won't necessarily have the smartest models they will have the most economically sustainable architectures.

The Model Router: The Central Bank of AI or Your Intelligent Cost Broker

Here is the single most impactful architectural decision you can make for AI cost management never route all traffic to your most expensive model.

Think of model selection like hiring for different tasks. You wouldn't hire a neurosurgeon to take your blood pressure. The same principle applies to LLMs. The key is building routing logic that can distinguish between request complexity and send each one to the appropriate (and cheapest) model tier that still meets your quality requirements.

A fast track to bankruptcy in the AI era is routing every single user request to your largest, most expensive model. You don't hire a PhD to grade middle school math.

Enter the Model Router. Acting as an intelligent broker, the router evaluates a prompt's complexity, predicts its cost, and directs it to the most efficient model available.

By intelligently matching the task to the tool, a routing layer dramatically slashes operational burn.

AI Model Router within Microsoft Foundry. You can also build a simple router yourself a lightweight classifier that scores prompt complexity and routes accordingly.

The Self-Hosting Illusion and the Quality Gap

To escape API costs, many organizations pivot to self hosting open source models. While this eliminates the "utility bill" of per token API pricing, it replaces it with heavy infrastructure ownership: GPUs, inference scaling, MLOps, continuous tuning, and failover management.

A word about quantization

Before we get into the tools, you need to understand quantization it's the key to making large models run on realistic hardware. By default, model weights are stored as 16-bit or 32-bit floating point numbers. Quantization reduces this precision (to 8-bit, 4-bit, or even smaller), dramatically reducing memory requirements while sacrificing relatively little quality.

In practice, you'll see model files with names like llama3.1-8b-Q4_K_M.gguf. The Q4_K_M means 4-bit quantization with the "K_M" quality variant (a good general-purpose choice). A 70B parameter model at full precision needs ~140GB of VRAM. The same model quantized to 4-bit needs about 40GB — suddenly runnable on 2 × RTX 4090 cards.

The Three Tools for Self Hosting LLMs

Ollama: The Fastest Path to Running a Local LLM

Open Source
Developer Friendly
Free

Ollama (ollama.com) is the tool that made local LLMs genuinely accessible. It packages model weights, quantization, and a local API server into a single binary that installs in minutes. Under the hood it uses llama.cpp a highly optimized C++ inference engine and supports GPU acceleration on CUDA (NVIDIA), Apple Metal (M-series chips), and AMD ROCm.

Best for

Individual developers, small teams, privacy-sensitive applications where data cannot touch external APIs, rapid prototyping, and high volume internal tools where marginal token cost matters.

Docker Model Runner AI Models as First-Class Container Citizens

Docker Native
Open API
Docker Desktop Required

Docker Model Runner (DMR) was introduced in Docker Desktop 4.40 in April 2025. The idea is elegant: if your entire stack runs in Docker, why should your AI inference layer be a separate tool managed differently? DMR brings model execution into the same CLI, the same Compose files, and the same mental model your team already uses.

Best for

Teams already running containerized microservices who want zero additional tooling overhead. If you already have Docker Compose fluency on your team, DMR is the lowest-friction path to local AI inference.

Microsoft Foundry Local Enterprise AI with a Governance Layer

The enterprise grade platform for building, deploying, and governing AI applications. In 2025, Microsoft added Foundry Local enabling on device and on premises inference with the full Azure governance stack intact.

This matters enormously for organizations in regulated industries. Running a local Llama model with Ollama is easy, but you lose auditability, policy enforcement, and compliance documentation. Foundry Local gives you the economics of self-hosting with the governance layer your compliance team requires.

Best for

Enterprise teams in regulated industries (healthcare, finance, government) who need self-hosted inference but cannot sacrifice compliance auditability. Also ideal for organizations with existing Azure infrastructure investments.

Bonus: Hugging Face Where the Models Actually Live

You can't talk about self-hosting without mentioning Hugging Face (huggingface.co) the open source model repository that underpins the entire local LLM ecosystem. Think of it as the GitHub for AI models. When you run ollama pull llama3.1, Ollama is ultimately pulling model weights that originated as Hugging Face repositories.

Hugging Face hosts over 900,000 models (as of 2025), including the Llama family (Meta), Mistral (Mistral AI), Gemma (Google), Phi (Microsoft), Qwen (Alibaba), and DeepSeek. It provides the transformers library for Python-based inference, the Hub API for programmatic model access, and hosted inference endpoints for teams who want managed API access to open models without running their own infrastructure.

For production scale: consider vLLM

One tool worth knowing about for high throughput production deployments is vLLM an open source inference server specifically optimized for serving LLMs at scale. While Ollama and Docker Model Runner are excellent for development and moderate workloads, vLLM implements techniques like PagedAttention and continuous batching that dramatically improve throughput for concurrent requests. If you're serving hundreds of simultaneous users from a self-hosted model, vLLM is the production grade serving layer.

The Quality Reality Check Be Honest With Yourself

I want to be direct here because too many self-hosting conversations gloss over this the quality gap between open-source models and frontier proprietary models is real. It is narrowing, but it exists. The gap is not uniform it shows up specifically in certain task types.

Furthermore, a quality gap persists. While open source models are improving rapidly, many still struggle with long context consistency, deep reasoning, hallucination control, and complex tool orchestration.

This creates a dynamic where hybrid routing becomes the gold standard open source models handle high volume, low risk workloads, while premium API models are reserved for critical reasoning.

The "good enough" question

The question is never "is it as good as GPT-4o?" The question is "is it good enough for THIS specific task in THIS specific context?" For classifying customer intent into one of 12 categories, a 7B local model is almost certainly good enough. For drafting a high-stakes client proposal, probably not. Know your use case before you decide on your model tier.

Embeddings: The Capital Assets of AI

If tokens are your operating cash, embeddings are your long term capital assets. They compress organizational knowledge into reusable vectors.

But like any asset, they depreciate. When underlying data changes, embeddings go stale, leading to degraded retrieval and increased hallucinations. Because re-indexing entire datasets constantly is financially ruinous, smart architectures act like portfolio managers—utilizing delta indexing, semantic diffing, and selective re-embedding to keep knowledge fresh cost-effectively.

Agents as Economic Actors

AI agents add a volatile layer to this economy because they iterate autonomously. They plan, retrieve, reason, and retry. Every loop burns tokens. Agents must therefore operate as economic entities, constantly balancing cost against accuracy, and depth against speed.

Operationalizing this requires frameworks that prioritize cost-awareness:

Skill Budgeting: Estimating token usage and model selection before execution.

Context Engineering: Retrieving precision chunks rather than bulk documents.

Cost Observability: Tracking real-time telemetry like token burn rate, cost-per-skill, and cost-per-user.

Practical Decision Framework

Here is the decision tree I use when figuring out how to route any given AI workload

What "Rich" Actually Means in a Per Token World

We started with a very real industry shift: the flat subscription is dying. GitHub Copilot, Cursor, Claude Code, and the tools your team uses every day are repricing themselves against the actual cost of computation. That is not going away. If anything, more tools will follow.

But here is the reframe that changes everything: per token pricing is not your enemy. It is actually the most honest pricing model that has ever existed for software. You pay for the intelligence you actually consume. The teams that will struggle are the ones who keep treating AI like a utility with a flat bill using frontier models for everything, flooding context windows, running agents without measuring cost per step.

The teams that will thrive are the ones who build with cost awareness as a first class concern from day one. That means routing intelligently, caching aggressively, using embeddings instead of brute force context, and for the right workloads minting their own tokens by running open source models on hardware they own.

Self hosting is not a compromise. For internal tooling, high volume pipelines, privacy sensitive data, and agent tasks that don't require frontier level reasoning, a well configured local model is the strategically correct choice. Not because it is "good enough" but because it is the right tool for the job, and it happens to cost nearly nothing per token.

The practical checklist

Defining Wealth in the AI Era

We are entering an era of token aware architecture. Tomorrow's software architects will also be economists, managing inference budgets and dynamic model marketplaces.

In traditional software, scale was measured by how many users a system could reliably serve. In AI systems, scale is increasingly defined by how efficiently intelligence is used under constraint. The most effective systems minimize unnecessary context, dynamically select the most cost efficient model for the task, and reduce redundant reasoning steps while still preserving accuracy and quality. In this new paradigm, value is not just about capability it is about delivering the right level of intelligence at the lowest possible token, compute, and latency cost.

We are no longer just building applications. We are designing self-contained economies of intelligence, and tokens are the currency that powers them.

Thanks
Sreeni Ramadorai

𝐘𝐨𝐮𝐫 𝐀𝐈 𝐀𝐠𝐞𝐧𝐭 𝐊𝐧𝐨𝐰𝐬 𝐖𝐡𝐚𝐭 𝐓𝐨 𝐃𝐨… 𝐁𝐮𝐭 𝐃𝐨𝐞𝐬 𝐈𝐭 𝐊𝐧𝐨𝐰 𝐇𝐨𝐰 𝐓𝐨 𝐃𝐨 𝐈𝐭? 🤔

Seenivasa Ramadurai — Wed, 06 May 2026 22:57:35 +0000

The missing piece in most LLM applications, and how AgentSkills fix it. We've gotten pretty good at telling AI agents who they are.

"You are an expert software engineer." "You are a seasoned marketing strategist." We hand them a persona, dump in some context, maybe paste in a few examples and then we hit send and hope for the best.

And for simple tasks? That works fine.

But the moment you ask an agent to do something that involves multiple steps, decisions, and potential failure points things start to fall apart in ways that are hard to predict and even harder to debug.

The agent sounds confident. It just doesn't behave consistently.

Here's why and what to do about it.

The Gap Nobody Talks About

There's a meaningful difference between knowing what needs to be done and knowing how to do it reliably.

A new employee on their first day might understand the goal perfectly "onboard this customer" but still flounder without a clear process. Do they send the welcome email first or set up the account? What if the system throws an error? Who do they escalate to?

Without a procedure, they improvise. Sometimes that works. Often it doesn't.

LLM agents have the exact same problem.

You can give an agent all the context in the world about what it's supposed to accomplish, and it'll still invent its own process every single time it runs. Skipping steps. Hallucinating validations. Silently glossing over failures.

This is the gap and it's where most LLM applications quietly break down.

Enter AgentSkills (and Why They're a Big Deal)

AgentSkills also called Procedure Skills are exactly what they sound like: explicit, step-by-step instructions that teach an agent how to execute a task, not just what the task is.

Think of it less like a prompt and more like a standard operating procedure. A playbook. A binder on the shelf.

Industry leaders like Anthropic and Microsoft have both converged on this idea and formalized it around a portable format called SKILL.md. That's not a coincidence it signals that the field is maturing from "prompt engineering" toward something more rigorous: procedure engineering.

What a Skill Actually Looks Like

A skill isn't a single prompt tucked inside a system message. It's a structured, self-contained unit of procedural knowledge a directory that bundles everything an agent needs to execute a specific type of task.

Here's how it breaks down:

SKILL.md is the core the instruction manual. It contains YAML frontmatter that lets the agent automatically discover and select the right skill for the job, plus detailed step-by-step execution instructions.

scripts/ holds small, single purpose automation scripts (Python, Bash, Node.js) for the steps that LLMs consistently get wrong when left to their own devices. Repetitive operations, file handling, API calls these belong in code, not in natural language instructions.

resources/ contains domain specific knowledge company standards, data schemas, regulatory rules anything the agent needs to reference but shouldn't be expected to memorize.

assets/ stores output templates. JSON schemas, document layouts, checklists so the agent produces consistent, structured results every time.

Put it all together and you get a self contained playbook instructions, tools, references, and templates in one place.

The Three Layers Most Teams Confuse

Before you can appreciate why skills matter, it helps to get clear on what they're not:

Most teams have prompts. Many now have tools. Very few have skills.

A skill is where workflow intelligence lives. It's the layer that answers the questions nobody bothers to write down: What comes first? What needs to be validated before moving on? What happens if this step fails?

Why Embedding All of This in a System Prompt Fails

The intuitive response to all of this is: "Can't I just put the procedure in the system prompt?"

You can. And for a single, small workflow it might work okay. But it breaks down fast for a few predictable reasons.

Fragility. Large, instruction-heavy prompts are brittle. One small tweak to the wording can cascade into completely different agent behavior. There's no modularity, no separation of concerns.

Token waste. Every time the agent runs, it pays the full token cost of every procedure even the ones that are completely irrelevant to the current task. At scale, this adds up fast.

Inconsistency. Without explicit validation steps ("check whether the file exists before editing it"), agents will invent shortcuts. They'll confidently skip steps and never tell you they did it.

The result is the thing that makes AI in production so frustrating: agents that sound certain and behave unpredictably.

The Idea That Changes Everything: Progressive Disclosure

Here's the mental model that ties this all together and it's dead simple.

Imagine your new employee's first day. You have two options:

Bad approach: Pile every binder all 50 of them on their desk. Tell them to read all of it before they start. By 11am they're exhausted, overwhelmed, and can't remember a thing.

Good approach: Put the binders on a shelf with clear labels. They glance at the labels, grab the one they need, read it, and do the job. Tomorrow, they grab a different one.

That's Progressive Disclosure.

In practice, it works in two phases:

Discovery Phase The agent loads only skill names and short descriptions. A table of contents for procedural knowledge. Minimal tokens, maximum orientation.

Activation Phase When a user request matches a skill's description, the agent loads the full SKILL.md and supporting assets into active memory. Only what's needed, only when it's needed.

The payoff is real: fewer hallucinations, lower token costs, better decisions when many skills exist simultaneously.

How to Design Skills That Actually Work

If you're going to build skills, these principles are worth internalizing from day one:

Write in third person imperative. "Extract the text." Not "You should try to extract the text." Precision matters ambiguous instructions produce ambiguous behavior.

Define failure states explicitly. What should the agent do when a script errors? When a file is missing? When validation fails? If you don't specify, the agent will improvise and you won't like the improvisation.

Keep skills small and composable. A skill called "Marketing" is a red flag. A skill called "Ad Copy Generation" is useful. A skill called "SEO Analysis" is useful. Small, focused skills compose into larger workflows. Monolithic skills just become another fragile mega prompt in disguise.

When Does This Actually Matter?

Not every situation calls for this level of structure. If you have one skill and it's always needed, just hand it to the agent upfront. Progressive disclosure doesn't help when there's nothing to disclose progressively.

But as your agent grows more tasks, more workflows, more edge cases the calculus changes:

10 skills, one needed at a time? Huge savings. Show only what's needed.
50 skills? Progressive disclosure becomes essential. Otherwise the agent drowns.
Complex multi step workflows? Explicit failure states and validation steps stop being nice-to-have and become the difference between an agent that works and one that confidently fails.

The Shift Worth Making

AgentSkills represent a genuine change in how we think about building with LLMs.

We're moving from prompt engineering which is ultimately about describing what we want to procedure engineering, which is about encoding how to reliably do it.

From probabilistic answers to deterministic execution.

From agents that talk about the work to agents that actually do it.

The tools and the personas are important. But without skills, you've hired a brilliant employee who has no idea how your company actually operates. Give them the binders. Label them clearly. Put them on the shelf.

That's the whole idea.

The one takeaway: MCP gives the LLM the tools. Skills tell the LLM when to use them. Progressive disclosure means "show only what's needed, when it's needed."

Thanks
Sreeni Ramadorai