Divy Yadav

Posted on May 20 • Originally published at pub.towardsai.net

LLMs, RAG, Agents, MCP: The AI Evolution You Actually Need to Understand

#agents #ai #llm #machinelearning

Most people still think AI is just a chatbot.

That idea is already outdated.

Modern AI systems browse the web, remember your preferences, execute code, query databases, call APIs, and coordinate workflows. They operate more like software employees than like a search bar.

This did not happen because models got smarter. It happened because the architecture changed.

Every layer of the modern AI stack exists because the previous layer had a real failure. Understanding what failed and why something new was built is the fastest way to understand how any serious AI product works today.

That is what this article covers. Every stage: LLMs, RAG, Agents, and MCP.

In this article, you will get a good idea about the full AI Evolution, and it took me a lot of research and work for this one.

So if you want more such information about AI, consider subscribing to my newsletter, where you will get noise-free information every week

Link for the newsletter: Newsletter

Stage 1: The LLM Era

What LLMs Actually Are

An LLM is a prediction engine.

Not a reasoning engine. Not a database. Not a search system.

Given text, it predicts what comes next. That prediction runs over and over, token by token, until a full response is generated. The model learns these predictions from enormous amounts of human text: books, articles, code, research papers, websites.

Input:  "The capital of France is"
Model:  [predicts next token]
Output: "Paris"

Simple idea. The scale is what makes it work.

A token is roughly 3 to 4 characters of English text. "Hello, world!" is about 4 tokens. Everything the model processes and generates is counted in tokens. This affects cost, speed, and the limits we will cover shortly.

Why It Felt Like a Big Deal

For the first time, a machine could:

Hold a fluent conversation in any language
Write code that actually ran
Summarize a 50-page document in seconds
Explain complex topics to a non-expert
Answer questions across almost any domain

Where It Broke Down

Then people started building real products. The limitations became obvious fast.

Hallucination: The model predicts what is plausible, not what is true. It will state wrong facts with total confidence.

Knowledge cutoff: Training data has a date. Ask about last week and it guesses.

No memory: Every conversation starts blank. The model has no idea what you talked about yesterday.

No access to your data: Your company documents, your database, your internal systems. The model knows none of it.

No ability to act: It produces text. It cannot send an email, run a query, or update a record.

Ask a pure LLM: "What was Apple's stock price yesterday?" It will either refuse or make up a number.

It has no connection to live systems. It is a very smart autocomplete engine. Autocomplete alone does not run a business.

This limitation is what created the next stage.

Stage 2: RAG Changes the Game

The Core Idea

RAG stands for Retrieval-Augmented Generation. One sentence covers it:

Before generating a response, retrieve the relevant information and give it to the model.

Instead of relying only on training data, the system fetches fresh, relevant context at the moment of each query.

A Simple Way to Think About It

Pure LLM: A student answering an exam entirely from memory. Sometimes brilliant. Sometimes confidently wrong.

RAG: The same student, but allowed to open their notes before answering. Answers are grounded in actual sources.

The model did not get smarter. It got better information to work with.

How RAG Works

USER QUERY
    ↓
RETRIEVE relevant documents
(from a vector database, using semantic search)
    ↓
INJECT those documents into the prompt as context
    ↓
LLM generates an answer grounded in the retrieved content
    ↓
RESPONSE (accurate, with sources)

The Technology Behind Retrieval

Embeddings are what make semantic search work.

Documents are converted into vectors, which are lists of numbers that represent meaning mathematically. Similar meanings end up close together in vector space. "Car" and "automobile" are close. "Car" and "photosynthesis" are not.

When a user query arrives, it is also converted to a vector. The system finds the stored vectors nearest to that query vector. Those are the semantically relevant documents, retrieved and injected into context.

Common vector databases:

Database	Best For
Pinecone	Managed, production-ready
Weaviate	Open-source, rich query support
Chroma	Development and small-scale use
FAISS	Fast, local, no managed infrastructure

What RAG Unlocked

RAG became the foundation of a lot of serious AI products:

Enterprise knowledge assistants
Customer support bots grounded in actual policy
PDF and document Q&A
Internal search that surfaces the right document
Any system needing up-to-date or private data

What RAG Still Could Not Do

Retrieval solves the knowledge problem. It does not solve the action problem.

RAG can find the answer to "what is our refund policy?" It cannot process the refund. It can tell you flight options. It cannot book the ticket.

For that, a different capability was needed.

Stage 3: The Rise of AI Agents

The Core Shift

Traditional AI:

User asks → Model answers → Done

Agent:

User sets a goal → Agent plans → Agent uses tools →
Agent observes results → Agent decides next step →
Agent continues until goal is complete

Agents reason, plan, use tools, and execute multi-step workflows. They operate rather than just respond.

Tool Calling: How Agents Reach the Real World

An LLM by itself cannot search Google, call an API, write to a database, or run code. Tool calling extends the model's reach.

User: "Find the cheapest flights from Delhi to Singapore next month."

Agent Step 1: Call flight search API with parameters
Agent Step 2: Receive results
Agent Step 3: Sort and compare options
Agent Step 4: Summarize the three cheapest options

The model decides which tool to call, with what arguments, and what to do with the result. It manages the whole workflow.

What Agents Can Do

A capable AI agent can:

Browse websites and extract information
Write, execute, and debug code
Send emails and messages
Query and update databases
Call any API with proper credentials
Coordinate with other agents
Schedule and manage workflows

Frameworks That Made This Practical

Building agents from scratch is tedious. Frameworks handle the boilerplate:

LangChain / LangGraph - most widely used, graph-based orchestration
AutoGen - multi-agent conversations, good for collaborative tasks
CrewAI - role-based agent crews for structured workflows
OpenAI Agents SDK - native tool calling with built-in orchestration

Where Agents Break

More power introduced more failure modes:

Context overflow: Long agent runs fill the context window. Earlier instructions get lost. Accuracy drops.

Memory fragmentation: Without a coherent memory system, agents lose track of what they were doing.

Tool confusion: Too many tools and the model picks the wrong one or misuses it.

Hallucinated actions: The model invents results from tool calls it never actually made.

Runaway loops: No stop condition means the agent keeps going when it should have asked for clarification.

There was also a deeper infrastructure problem. Every agent integration was custom-built. Connecting to Slack required one connector. Google Drive required another. Salesforce required another. There was no standard. Scaling meant a growing stack of hand-built code.

That is what MCP was built to fix.

Stage 4: MCP, The Protocol That Standardizes Everything

The Problem Before MCP

Before November 2024, connecting an AI system to external tools meant:

Custom integration for every tool
Different formats for every API
No standard for how models discover what tools are available
No consistent way to pass context or results between systems

Every new data source required its own implementation. This was not an AI limitation. It was an infrastructure limitation.

What MCP Is

MCP stands for Model Context Protocol. It is a standard for connecting AI assistants to the systems where data lives, including content repositories, business tools, and development environments.

Anthropic announced MCP in November 2024 and open-sourced it on day one.

MCP defines a universal interface for:

Reading files and data sources
Executing functions and tools
Handling context and prompts
Coordinating between AI systems and external environments

The USB-C analogy is actually a good one here. Just as USB-C made it easy to connect any device to any peripheral, MCP makes it easier to connect any AI model to any data source or tool. One protocol, many integrations.

How It Works Architecturally

MCP servers expose three things:

Tools: actions the model can call
Resources: data the model can read
Prompts: templates for interaction

The model queries the server to discover what is available, then invokes tools in a structured, validated format.

Adoption

MCP did not take years to catch on. Since launch, it has been adopted by OpenAI, Microsoft, Google, and Cloudflare. The Python and JavaScript SDKs together see over 20 million weekly downloads. Over 13,000 MCP servers launched on GitHub in 2025 alone.

In December 2025, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, co-founded by Anthropic, Block, and OpenAI. It now sits alongside Kubernetes and PyTorch in that portfolio.

The Honest Limitations

MCP is not perfect. Security is a real concern.

Security researchers identified multiple issues with the protocol, including prompt injection, tool permissions that allow data exfiltration, and lookalike tools that can silently replace trusted ones.

The spec does not enforce audit trails, sandboxing, or verification. MCP solves the connectivity problem. Organizations deploying it at scale are responsible for building the security layer on top.

Context Engineering: The Layer Tying It All Together

Context engineering is what makes everything above work reliably in production.

Prompt engineering is writing a good instruction. Context engineering is designing the entire information environment the model operates in:

Memory: what the model remembers from previous interactions
Retrieval: what documents or data are fetched for each query
Tools: what actions are available and how they are described
History: how much of the conversation is included
System state: what the model knows about its current task
Workflow position: where in a multi-step process the model is

The most capable AI systems today are not just better models. They are better systems designed around the model.

Context engineering is what separates an agent that works in production from one that works in a demo.

What the Modern AI Stack Looks Like

A serious AI product in 2026 is a system, not just an API call:

User Interface
    ↓
Orchestration Layer (LangGraph, AutoGen, custom)
    ↓
Context Manager
├── Memory Layer (conversation history, user preferences)
├── Retrieval Layer (vector DB, semantic search)
└── State Manager (task progress, tool outputs)
    ↓
Tool Layer (via MCP or custom integrations)
├── Web search
├── Database queries
├── API calls
├── Code execution
└── File operations
    ↓
LLM (GPT-4o, Claude, Gemini, open-source)
    ↓
Response + Actions

Each layer solves a specific failure from the layer before it. Remove any layer and you reintroduce the problem it was built to solve.

Which Layer Do You Actually Need?

Do not over-engineer this.

A simple RAG pipeline is the right call for most document Q&A use cases. A complex agent adds coordination overhead you do not need if the task is just retrieval.

Add a layer only when the simpler system actually cannot meet the requirement.

What Comes Next

A few trends worth watching:

Long-term memory: Agents that remember your preferences across months, not just sessions.

Multi-agent collaboration: Networks of specialized agents coordinating on shared goals, where each handles one domain.

Deeper real-world execution: Tighter integration with operating systems and software, not just APIs.

Autonomous workflows: Agents that manage their own task queues without step-by-step human orchestration.

The bottleneck has moved. In 2020, it was model intelligence. In 2026, it is system design: how well you manage memory, retrieval, tool coordination, and context across a complex workflow.

The Real Takeaway

The biggest mistake people make is thinking the model is the entire product.

It is not.

Modern AI systems are architectures: memory, retrieval, orchestration, tool ecosystems, context managers, and execution environments wrapped around a model.

The future will not be decided only by which model is best.

It will be decided by which system is built best around it.

If you found this useful, I write about AI engineering weekly in my newsletter AI Engineering Simplified. No hype, just practical breakdowns.

Top comments (1)

Stell • May 20

Clear and useful. Just keep in mind: the real evolution isn't only about better tools — it's about what lives between the tool calls. Nice write-up though.