Taki

Posted on Jan 24

LLM, RAG, AGENT And MCP

#webdev #programming #ai #javascript

🧠 Modern AI Systems — A Practical, End-to-End Mental Model

Goal: Understand how LLMs, RAG, AI Agents, and MCP fit together to build real production AI systems, not demos.

Modern AI is not one model.
It is a system of responsibilities.

LLM   → understands & reasons with language
RAG   → retrieves correct knowledge
Agent → decides & performs actions
MCP   → standardizes context & tools

Each layer exists because the previous one cannot solve real-world problems alone.

1️⃣ LLM (Large Language Model) — The Brain

What an LLM actually is (no marketing)

An LLM is a probabilistic model trained to predict:

“What token comes next?”

That’s it.

It does not:

think like a human
reason independently
“know” facts by default

It predicts patterns extremely well.

Mental model (important)

🧠 A very powerful autocomplete engine

Example:

"The capital of France is ___"

→ “Paris”

Not because it understands geography,
but because that sequence appears frequently in training data.

What LLMs are good at

✅ Natural language understanding
✅ Writing & summarization
✅ Translation
✅ Code generation
✅ Reasoning within provided context

Critical weaknesses

❌ Hallucination (confidently wrong answers)
❌ No access to private/company data
❌ No long-term memory
❌ Knowledge cutoff
❌ Cannot take actions (no APIs, no workflows)

👉 LLMs alone are not usable in production systems.

This leads to the next layer.

2️⃣ RAG (Retrieval-Augmented Generation) — Giving the Brain Memory

Problem RAG solves

“How can the LLM answer questions using our private, up-to-date data without retraining it?”

Core idea (simple)

Instead of:

User → LLM → Answer (may hallucinate)

We do:

User → Retrieve relevant data → LLM → Answer grounded in data

LLM = language & reasoning
RAG = knowledge retrieval

How RAG works (step by step)

1. Prepare data

Split documents into chunks
Convert text → vectors (embeddings)
Store in a vector database

2. User asks a question

Question → vector

3. Similarity search

Retrieve the most relevant chunks

4. Prompt the LLM

SYSTEM:
Answer only using the provided context.

CONTEXT:
[retrieved documents]

USER:
What is the refund policy?

5. LLM answers

Grounded
Auditable
Much lower hallucination risk

Why RAG is mandatory in real systems

Without RAG:

AI makes things up
Legal and compliance risk
Users lose trust

With RAG:

Accurate answers
Data can be updated anytime
No model retraining

RAG tooling ecosystem

Embedding models

OpenAI embeddings
Cohere
SentenceTransformers
BGE / E5 / Instructor

Vector databases

Pinecone
Weaviate
Qdrant
Milvus
FAISS
MongoDB Atlas Vector Search
Elasticsearch (vector)

Frameworks

LangChain
LlamaIndex
Haystack

RAG limitation

RAG can:
✔ answer questions

RAG cannot:
❌ decide what to do
❌ call APIs
❌ run workflows

That requires the next layer.

3️⃣ AI Agent — Giving the Brain Hands & Goals

Problem agents solve

“I don’t just want answers — I want the AI to do things.”

Example:

“Check my order, see if it’s delayed, open a ticket, notify me.”

This is multi-step work.

What an AI Agent is

An AI Agent =

LLM
+ tools
+ memory
+ decision loop

Core agent loop (critical concept)

1. Observe (input & state)
2. Reason (LLM)
3. Choose action
4. Execute tool
5. Observe result
6. Repeat until goal achieved

This is often called a ReAct loop (Reason + Act).

Example agent tools

Tool	Purpose
search_docs	RAG search
get_order_status	Backend API
create_ticket	CRM / Support
send_email	Notification
write_db	Memory / Logging

When to use (and not use) agents

Use agents when:

Multi-step reasoning is required
Tools must be orchestrated
Decisions depend on outcomes

Do NOT use agents for:

Static FAQs
Simple Q&A
Single-step tasks

Agents are:

Slower
More expensive
Harder to debug

Agent tooling ecosystem

Frameworks

LangChain Agents
OpenAI Assistants
AutoGen
CrewAI
Semantic Kernel

Execution

REST / gRPC
Function calling
Webhooks

Memory

Redis
PostgreSQL
Vector databases
In-memory stores

4️⃣ MCP (Model Context Protocol) — The Nervous System

This is architecture-level, not prompt engineering.

The scaling problem MCP solves

As systems grow:

Prompts duplicated everywhere
Tools defined inconsistently
Context assembled differently per service
Agents break when tools change
Models tightly coupled to apps

This becomes prompt spaghetti 🍝

What MCP is (plain English)

MCP is a protocol that standardizes how models:

discover tools
receive context
access resources

Think of it as:

📡 A REST API for LLM context and capabilities

Mental model

LLM / Agent
   ↓
MCP Server
   ↓
Tools | Data | Memory | Capabilities

The model does not guess what it can do.
It discovers capabilities via MCP.

Why MCP matters

With MCP:

Clean architecture
Tool discoverability
Model-agnostic systems
Easier testing & maintenance

Without MCP:

Hidden coupling
Fragile agents
Hard-to-replace models

MCP ecosystem

OpenAI MCP
Anthropic MCP
Custom MCP servers
Integrations: databases, filesystems, APIs, GitHub

5️⃣ Real-World Project — End-to-End System

Project: AI Customer Support Assistant (E-commerce)

Requirements

Answer policy questions
Check order status
Handle refunds
Escalate to humans when needed

Architecture

Chat UI
  ↓
Backend API (e.g. NestJS)
  ↓
MCP Server
  ↓
Agent
  ↓
RAG + Business Tools

Component responsibilities

LLM

Language understanding & reasoning

RAG

Product docs
Refund & shipping policies
FAQs in vector DB

Agent

Decide when to search
Call order APIs
Create tickets
Escalate issues

MCP

Defines tools:
- search_knowledge_base
- get_order_status
- create_support_ticket
Provides clean, consistent context

Example user flow

User:

“My order hasn’t arrived. What should I do?”

Agent:

Retrieve shipping policy (RAG)
Call order status API
Evaluate delay
Decide next action
Respond or open ticket

No hallucination
No hardcoded prompts
Fully scalable

🧠 Final Mental Model (memorize this)

LLM   → understands language
RAG   → retrieves truth
Agent → performs actions
MCP   → organizes everything

🎯 Target Project (One Project, Many Levels)

AI Knowledge & Action Assistant for a Company

Answers questions from internal docs

Can take actions (create tickets, generate reports)

Safe, auditable, scalable

Stack:

Frontend: Next.js
Backend: NestJS
AI: LLM + RAG + Agents + MCP
Infra: Docker, Env-based config

PHASE 0 — Mental Model (Day 0)

Before writing code, understand this flow:

UI → API → AI Core → Tools → Result → UI

Everything you build later fits somewhere here.

If you don’t know where a piece belongs, don’t code it.

PHASE 1 — LLM Basics (Beginner)

⏱ Time: 1–2 days
🎯 Goal: “I can talk to an LLM safely via backend”

1.1 What you build

A simple chat API:

POST /chat
{
  "message": "Explain SOLID principles"
}

Response:

LLM text

1.2 Architecture (minimal but correct)

Next.js
  ↓
NestJS Controller
  ↓
AI Service
  ↓
LLM Provider (OpenAI / Gemini / Claude)

1.3 Key lessons here (VERY important)

✅ Backend owns AI calls

Never call LLM directly from Next.js.

Why:

API key security
Rate limiting
Observability
Cost control

✅ Prompt ≠ Message

Start separating:

system prompt
user prompt

This prepares you for agents later.

1.4 Common beginner mistakes

❌ Hardcoding API keys
❌ No timeout handling
❌ No token limits
❌ Trusting LLM output blindly

Exit criteria

✔ You can explain what an LLM can & cannot do
✔ You understand tokens & costs
✔ You never expose LLM keys to frontend

PHASE 2 — RAG (Intermediate Foundation)

⏱ Time: 3–5 days
🎯 Goal: “My AI answers using MY data”

2.1 What you add

Document ingestion
Embeddings
Vector search

2.2 Architecture upgrade

User Question
  ↓
Vector Search
  ↓
Relevant Chunks
  ↓
LLM (context injected)
  ↓
Answer + Sources

2.3 What you actually build

Backend

/documents/upload
/documents/index
/ask

Storage

Raw files (S3 / local)
Vector DB

2.4 Chunking (don’t skip this)

Bad chunking = bad AI.

Rules:

300–800 tokens per chunk
Overlap ~10–20%
Keep semantic meaning intact

2.5 Prompt discipline (critical)

Your prompt should say:

“Answer ONLY using provided context.
If missing, say you don’t know.”

This single rule prevents 80% hallucinations.

Common RAG failures

❌ Stuffing too much context
❌ No metadata filtering
❌ No source citation
❌ Treating vector DB as magic

Exit criteria

✔ AI answers correctly from internal docs
✔ Hallucination rate is low
✔ You can swap vector DB without rewriting logic

PHASE 3 — Structured AI Core (Pre-Agent)

⏱ Time: 2–3 days
🎯 Goal: “AI logic is modular and testable”

3.1 Why this phase exists

If you jump straight to agents:

💥 You will create an un-debuggable mess

So first: structure the AI core.

3.2 Introduce these concepts

Prompt templates
Output schemas (JSON)
AI “use cases”

Example:

AnswerQuestionUseCase
SummarizeDocUseCase
ExtractTasksUseCase

Each one:

Has input
Has prompt
Has expected output

3.3 This unlocks later

Tool calling
Agents
Validation
Retries

Exit criteria

✔ AI responses are structured
✔ You can validate outputs
✔ You can test AI logic without UI

PHASE 4 — AI Agents (Action Layer)

⏱ Time: 4–7 days
🎯 Goal: “AI can plan and act, not just talk”

4.1 What changes conceptually

From:

Request → LLM → Response

To:

Goal → Think → Act → Observe → Repeat

4.2 What you build

Agent with:

Goal
Memory
Tool registry
Stop conditions

4.3 Example Agent

Goal:

“Create a weekly report and open tasks”

Tools:

search_docs
create_jira_ticket
generate_markdown

4.4 Critical safety rules

Max steps
Max tokens
Tool allowlist
Read-only vs write tools

This is non-negotiable in production.

Common agent failures

❌ Infinite loops
❌ Too much autonomy
❌ No human approval
❌ No logs

Exit criteria

✔ Agent completes tasks reliably
✔ You can stop it at any time
✔ Every action is logged

PHASE 5 — MCP (Production Tooling Layer)

⏱ Time: 3–5 days
🎯 Goal: “Safe, scalable tool integration”

5.1 What MCP gives you

Tool discovery
Strong schemas
Permission control
Replaceable tools

5.2 Architecture

Agent
 ↓
MCP Client
 ↓
MCP Server
 ↓
Tool Implementations

5.3 Why MCP matters in production

Without MCP:

Hardcoded tools
Unsafe execution
Tight coupling

With MCP:

Clean contracts
Auditing
Enterprise-ready

Exit criteria

✔ Tools are schema-defined
✔ Permissions are enforced
✔ Agents can’t “invent” tools

PHASE 6 — Production Hardening

⏱ Time: ongoing
🎯 Goal: “This won’t wake me up at 3AM”

6.1 Mandatory production features

🔐 Security

API auth
Tool permissions
Input sanitization

📊 Observability

Prompt logs
Token usage
Agent step traces

💰 Cost control

Token budgets
Rate limits
Model tiers

6.2 Human-in-the-loop

For risky actions:

Show plan
Ask approval
Then execute

PHASE 7 — Scaling & Multi-Agent

⏱ Time: advanced
🎯 Goal: “AI team, not AI bot”

Examples:

Planner agent
Executor agent
Reviewer agent

Each has one responsibility.

FINAL MENTAL MODEL (Memorize This)

Phase 1: LLM → Brain
Phase 2: RAG → Knowledge
Phase 3: Structure → Discipline
Phase 4: Agent → Action
Phase 5: MCP → Safety
Phase 6: Production → Survival