lou

Posted on Mar 5

Understanding the Agentic AI Ecosystem: Prompts, Memory, RAG, MCP, and Tool-Using LLMs

#rag #llm #mcp #genai

Gen AI is a type of AI that’s capable of generating new data like images, text, audio, etc, based on the training data.

The main idea is: you use a large amount of data, for example images of children, and you train a model on it. Once it’s trained, you give it a prompt (you ask it a question). In that prompt you can ask it, for example: “generate for me a new image of a child, inspired by the training data”. And now these models are capable of understanding prompts and generating what we request.

When we speak about AI, we basically have two big families:

discriminative AI
generative AI

Discriminative AI

Discriminative AI is what we know traditionally. Like when you build a classification model: you will use a dataset that contains dogs and cats, you label the images, and you train the model so it can predict if the animal is a cat or a dog.

This is done with neural networks (réseaux de neurones): you train it on data where it learns the mapping between input and output. Once training is done, when you give it a new image, it predicts the label.

And we can also do regression (predicting a continuous value instead of a class).

Generative AI

But with Gen AI, instead of training a model only to classify, you train it on a large dataset (cats, dogs, everything), and then you can ask it prompts like: “generate a new dog next to a cat”. The image is new, it has never been in the dataset, but it follows the patterns the model learned during training.

So Gen AI is a type of AI capable of generating new data based on a large dataset provided during training.

A quick timeline (why everything exploded)

In 2014, generative AI had a big milestone with GANs (Generative Adversarial Networks), which became popular for image generation. And you probably know this famous website that generates faces of people who don’t exist: thispersondoesnotexist. (This kind of thing is based on GAN-style approaches.)

Then in 2017, Google researchers published Attention Is All You Need, which introduced the Transformer architecture and self-attention. This is one of the main reasons why language models became so powerful for NLP (Natural Language Processing). From there, everything accelerated.

In 2018, Google released BERT (Transformer-based).
Then OpenAI built the GPT line (Generative Pre-trained Transformer), and we started seeing large language models becoming mainstream.

Later, the big public explosion came with ChatGPT (and especially the versions that made people realize “wait… it actually understands what I mean”). After that we got a crazy wave: GPT-4, Gemini, LLaMA, Mistral, DeepSeek, and many others. Now we have models that can generate text, images, audio, and even video.

So what’s important for us is to understand these different model types we can use:

LLMs that take text in and give text out (classic chat models)
multimodal models that take text + image and can answer about images
models that generate images from text (like DALL·E, Stable Diffusion)
OCR apps (Optical Character Recognition) where we extract text from images

Why the AI jump happened (3 reasons)

1) Transformers

Transformers made it possible to understand language syntax + semantics much better than older approaches. That’s why models can answer complex questions and follow instructions.

2) Data

We have a massive amount of data available (web-scale). That’s what enabled training large models.

3) Compute (GPUs / TPUs / cloud)

Transformers are heavy: it’s a lot of matrix computation, and it takes time. So we needed acceleration. GPUs (like NVIDIA) are perfect for parallel computation: many cores, parallel operations, faster training.

That’s why companies training LLMs need GPUs (or TPUs like Google’s). And because of cloud computing, we can access high-performance computing (HPC) without owning a full data center.

And these are the main reasons behind the jump in AI.

Prompts and prompt engineering

You have LLM models and you want to create an app. You build the app, and your app needs to ask the LLM.

So you send a prompt. The prompt is basically the question/instructions you send to the model. The model understands the prompt and generates a response, and the response is sent back to your app.

Using Gen AI in production means doing prompt engineering.

There are two things:

what is a prompt
what is prompt engineering

Prompts

A prompt is a group of instructions (text) that you send to an LLM to do a certain task. The way you ask the question changes the outcome. There are best practices.

Prompt engineering

Prompt engineering is about:

creating prompts
reviewing prompts (yes there are metrics and evaluation techniques)
deploying prompts (building apps that use these prompts to solve a specific domain problem)

Prompt structure: system message + user message (+ examples)

A prompt usually has 3 parts:

1) System message

This is the instruction that explains to the model the role + the task + constraints.

For example for sentiment analysis, the system message can assign a role:

“You are an analyst. Deduce if the sentiment is positive, neutral, or negative. Output JSON.”

This system message is usually fixed. It’s the part you design and improve.

2) Few-shot examples (optional)

Examples of input/output to guide the model.

For example:

“I’m really satisfied” → {"sentiment":"positive"}
“I am not a fan” → {"sentiment":"negative"}
“I will get back to you” → {"sentiment":"neutral"}

These examples make the model more consistent and more precise.

If you give:

0 examples: zero-shot
1 example: one-shot
more than 1: few-shot

In general, the more relevant examples you give, the more stable the output becomes (within the limit of the context window).

3) User message

This is the user input. The user writes a comment/question, and your app injects it into the prompt.

So it’s like:
“Here are the instructions (system message), here are examples (optional), now answer the user question.”

Params: temperature, max tokens, context window

When we send a prompt to an LLM, we also send parameters. The most common ones:

Temperature

Depends on the task.

If you want precision (classification, extraction, strict formatting), set temperature close to 0.
- Same question multiple times gives nearly the same answer.
If you want creativity (story, brainstorming), temperature closer to 1.
- Same question gives different variations.

Example:

reading blood test results → temperature 0
fantasy short story → temperature close to 1

Max tokens + tokenization

LLMs don’t see text like humans. They tokenize it.

OpenAI tokenizer:
https://platform.openai.com/tokenizer

The main idea: in NLP we create a vocabulary (dictionary) of tokens. Tokenization can be simple (split by spaces), but modern tokenizers are more advanced. Each LLM uses its own tokenizer. In Python, OpenAI has tiktoken.

So text becomes tokens, each token becomes an ID, and the model works on sequences of token IDs.

This matters because billing is often based on token count, and also because the model has a maximum limit: the context window (the maximum tokens it can accept in one prompt). That’s why in prompt engineering we also try to economize tokens.

Chain-of-thought prompting (and agents)

Another concept people talk about is helping the model “think step by step”. This idea is heavily related to agentic systems (we’ll get there).

Experiment locally with Ollama (open source)

To experiment, let’s install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Once installed, you can call the local API at port 11434. More on tool calling here:
Ollama tool calling docs

We will use the qwen3 model.

Run:

ollama run qwen3

Then you can send it a request using your API client like so:

Fine-tuning vs RAG

Sometimes few-shot prompting isn’t enough. In that case, people do fine-tuning: you take a base model, and you retrain it on your own dataset (domain-specific input/output pairs).

Example in a pharmaceutical context: you want the AI to know what certain meds are used for. You prepare data like this:

{
  "messages": [
    {
      "role": "user",
      "content": "Patient takes Doliprane 1000 mg for fever and headache. What medication is this?"
    },
    {
      "role": "assistant",
      "content": "Doliprane contains paracetamol and is commonly used to treat fever and mild to moderate pain."
    }
  ]
}

That gives you a more specialized model.

Fine-tuning is expensive because you are updating model weights (parameters). There are techniques like LoRA where you don’t fully modify the base weights, you add lightweight adapters (additional matrices) so the base model stays intact and you still get specialization with less compute.

RAG (Retrieval-Augmented Generation)

Instead of changing the model, you can keep the model as-is and provide it with relevant context at question time.

In an enterprise system (E/S), you have documents that are not structured: PDFs, text files, sometimes images, audio, JSON, etc.

Because the LLM has a context window, you can’t just dump millions of pages. So we split documents into chunks (paragraphs, pages, sections). Then we transform these chunks into vectors using an embedding model.

Each chunk becomes:

a vector (numbers representing semantic meaning)
plus metadata (document name, page number, etc)

We store vectors in a vector store / database (many DBs support vector fields now).

Then when the user asks a question:

the question is embedded into a vector (same embedding model)
we do semantic search (similarity search) against the chunk vectors
we retrieve the most relevant chunks = the context
we send to the LLM: system message + context + user question
the LLM answers using only that context

This is why similarity metrics matter. A classic one is cosine similarity: we compare vector directions. If the cosine is close to 1, vectors are similar (meaning is close). Databases like pgvector allow vector fields and similarity queries.

So RAG is: non-structured data → chunks → embeddings → vector DB → retrieve context → answer grounded in context.

And to reduce hallucinations, in the system message for RAG we often say something like:
“Answer using the provided context only. If the answer is not in the context, respond with ‘I don’t know’.”

A lot of enterprise systems use RAG to create internal chatbots. And you can also do multimodal RAG.

Multimodal RAG (text + audio + image + video)

How does it work with different modalities?

For audio/video: usually you do speech-to-text transcription first, then chunk the text, embed it, and store the vectors.
For images: you can either:
- extract text with OCR (if the image contains text), then embed,
- or generate a textual description/caption for the image with a vision-capable model, then embed that description.

So the vector DB still stores vectors representing semantic meaning, and the metadata helps you locate the source (document, page, timestamp, file name, etc). When you retrieve context, you also retrieve metadata, and you can instruct the LLM to cite where it came from.

This is why multimodal RAG is powerful: you can ask a question and the answer might come from a PDF, an audio lecture, an image, or a video transcript.

So now with RAG, an enterprise system can build an internal chatbot that responds based on internal data.

Agentic AI ecosystem

To explain how it works, we have a user who asks a question to an agentic app.

Suppose you want to ask the agent to send an email to the whole student body: tell them that on a certain date we will have exams, and before that there will be mock exams. In the email, you ask them to prepare specific topics, and to bring their machines on the mock exam dates.

The user doesn’t write the long detailed instructions. In a real product, the app prepares that. The user writes something simple like:

“Send the email to the students for the mock exams.”

When you send that to the agent, the agent is autonomous: it has a goal to achieve. But to make it do the right thing, you still need to describe the task clearly, so you give it a prompt.

The agent prompt usually has:

system message
optional few-shot examples
user message

In our example we don’t need few-shot examples. The system message is the detailed instruction text (written by the app / prompt engineering), and the user message is the short line: “send the email to the students for mock exams.”

The agent has a goal: automate a workflow. That’s why prompt engineering matters: writing prompts, reviewing prompts, testing them, deploying them.

Memory: because LLMs are stateless

LLMs are stateless. When an agent asks the LLM a question, the LLM is basically a function doing matrix computation: text becomes tokens, tokens become vectors internally, and it outputs a response.

If you ask the LLM:

Q1: “my name is lou” → it replies “hello lou”
then later you ask:
Q2: “what’s my name?” (without giving it Q1 again) → it will say “I don’t know” because it does not remember.

So if you want it to remember, you have to include the conversation history (or summary) in the prompt each time.

That’s why ChatGPT is not just “the LLM”. ChatGPT is an application (agentic app). It stores the conversation in a database, and every time you ask a question, it loads the memory and injects it into the prompt, then sends it to the LLM.

So memory = persistence. Session history stored in a DB. And you can store it in different ways:

raw conversation (full history)
summarized memory (semantic memory): store short summaries instead of full logs
structured memory tables, last-in-first-out, etc, depending on your design

So agents need memory, but they also need tools.

Tools: how the agent interacts with the world

What is a tool? It’s a function your app exposes to the agent to consult data or do actions.

Examples:

a tool to search PDFs using RAG (vector DB)
a tool to read Google Sheets
a tool to do web search
a tool to connect to IoT sensors
a tool to publish on social media
a tool to use Google Maps
a tool to send emails (Outlook / Gmail)

MCP: Model Context Protocol

Now the question is: how do we make it easy for an agent to use tools?

There is a protocol called MCP (Model Context Protocol). With MCP, you can create a server that exposes tools (functions) and gives the agent a standard way to discover and call them.

So when we create an agent, we give it the MCP server addresses, and the agent can use anything that is exposed there, even if tools are implemented with different technologies.

It’s like a new web service style for AI tools.

We spoke before about web services like SOAP, REST, GraphQL, and gRPC, in these posts:

SOAP: here
REST: here
GraphQL: here
gRPC

Now MCP helps create tool servers that are easily exploitable by agentic AI.

The flow in our example (mock exam email)

So we send the user request to the agent.

The agent asks the LLM: “respond to this request”. The LLM sees: “I need to send email to all students”, so it knows it needs:

the student list
the exam planning details
then draft the email
then send it

So the LLM tells the agent to call a tool to get the student list. The agent calls the tool (Google Sheets). It retrieves all student emails, and sends that back to the LLM as observations.

Then the LLM says: “I need the exam planning.” The agent calls the planning tool (another Google Sheet), retrieves the plan, and sends it back to the LLM.

Now the LLM generates the email content (subject + HTML body), and asks the agent to send it using the email tool (Outlook/Gmail). The agent sends it and confirms success.

This kind of app is what we call “agentic”, and a common pattern used here is ReAct:

Reasoning (LLM)
Action (tools)
Observations (environment/tool outputs)

What’s the principle?
At first you ask the agent. It reasons (using an LLM). The LLM decides the steps and the tools needed. The agent calls tools, retrieves data, sends observations back to the LLM, and the loop continues until the agent can complete the final action.

Multi-agent (agent-to-agent)

Sometimes one agent is not enough. So you can have specialized agents (each one has tools or skills for a specific domain). If an agent doesn’t have the right tools, it can communicate with another agent using an agent-to-agent protocol (often HTTP style).

So:

MCP helps access tools
agent-to-agent helps agents communicate between each other

And if you want voice interaction, you can also have real-time communication (RTC): user speaks in any language, a real-time model transcribes, sends the text to the agent, and then the agent responds (text-to-speech back).

That’s the full ecosystem: prompts + memory + tools + MCP + agents + ReAct loops, and optionally multi-agent collaboration.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.