Renato D. Prado

Posted on May 13

Agentic AI - Part 1: foundations

#ai #llm #machinelearning #agents

Agentic AI: a tech lead's glossary

Study notes from coursers like Pluralsight on agentic AI and other references, organized as a glossary I wish I'd had on day one.

Every dev I know is using AI tools, and most of us are fuzzy on the words behind them. Where does a transformer fit in? What does MCP actually solve? Is "agentic AI" a real thing or just rebranded chatbots?

This is my map of the territory: machine learning at the bottom, agents and MCP at the top, and the concepts in between — tokens, memory, tools, RAG, vector databases. Built for lookup, not for a single read-through. If a term is fuzzy in your head, jump to it.

Foundations

Machine Learning

Normal coding: you write the rules. "If the email subject says 'free money,' mark it as spam."

Machine learning flips that. You don't write the rules — the program finds them by trial and error. You give it thousands of emails labeled "spam" or "not spam." For each one, the program guesses the label, compares its guess to the correct answer, and nudges a bunch of internal numbers (its weights) to be a little less wrong. Do that millions of times across thousands of examples, and those weights settle into something useful.

What does a weight actually look like? Picture the spam detector keeping a running score for each email. See the word "viagra"? Add 0.8 to the score. Unknown sender? Add 0.5. Lots of exclamation marks? Add 0.2. If the total clears a threshold, the email gets tagged as spam. Those numbers — 0.8, 0.5, 0.2 — are weights. Training is the process of finding what each one should be.

What comes out of training isn't code. It's the model — a file holding all those weights. A simple spam filter might have thousands. An LLM has billions. Hand the model a new email, and it produces a verdict: spam or not.

Want to feel it for yourself? Try Google's Teachable Machine (teachablemachine.withgoogle.com). Train an image classifier in your browser with your webcam — show it a few photos of "happy face" vs "sad face" and watch it learn to tell them apart.

Where LLMs fit

A few useful distinctions before going further:

AI is the broad field — anything where a machine does something we'd call intelligent.
Machine learning is a subset of AI: systems that learn patterns from data instead of being explicitly programmed.
Deep learning is a subset of ML that uses multi-layer neural networks.
LLMs are a specific application of deep learning, built on the transformer architecture (introduced in 2017).

So: LLM ⊂ deep learning ⊂ ML ⊂ AI. Claude, ChatGPT, Gemini, Llama — all LLMs.

These terms get used interchangeably, but they're not the same thing. A spam filter and ChatGPT are both AI, both ML, but only one is an LLM.

Neural networks and deep learning

A neural network is one specific kind of ML algorithm. Loosely inspired by the brain: layers of nodes connected to each other, every connection carrying one of those weights from before. Input enters on one side, flows through the layers, a prediction comes out. The training loop is the same as before — guess, check, nudge the weights — there are just far more weights to nudge.

"Deep learning" is just a neural network with many layers — the deep refers to the layer count.

Why depth matters: each layer learns a more abstract pattern than the one below it. For image recognition, the first layer might detect edges; the next, shapes; the next, eyes and noses; the next, full faces. For text: letters, then words, then phrases, then meaning. Stack enough layers and the model captures complex patterns.

The deep learning boom of the 2010s happened because GPUs made training big networks practical, and there was enough digital data to feed them.

LLMs are deep neural networks. Specifically, ones built on an architecture called the transformer.

Transformers — and what makes an LLM different

Earlier language models processed text word by word, in order — slow to train, and they tended to lose track of context across long sentences. The transformer, introduced in a 2017 Google paper called Attention Is All You Need, fixed both problems with two changes:

It looks at the whole input at once. No more reading sequentially. The model sees the full sequence and processes it in parallel — which also made training scale across GPU clusters.
Attention pairs every word with the others that matter. Think of it like reading with a highlighter, where the highlighting is automatic — learned from training data. For each word in the input, attention scores the others by relevance, weights them, and blends the most important ones into how that word is processed.

Concrete example: in "The cat sat on the mat. It was tired," when the model reaches "it," attention scores "cat" high and "mat" low. That's how the model knows "it" refers back to "cat."

Every well-known LLM since runs on this architecture: GPT, Claude, Gemini, Llama. Same core design, just bigger training data and more weights.

So what's actually different about an LLM compared to the spam filter from earlier?

Generality. Old ML: one model per task. An LLM is a single model that handles summarization, code, translation, and reasoning.
Generation vs. classification. Old ML predicts a label or a number ("spam" / "not spam"). LLMs produce text, which lets them do almost anything we can describe in words.
Scale. Old ML: thousands to millions of weights, single GPU. LLMs: billions of weights, GPU clusters, weeks of training, millions of dollars per training run.

Tokens

A token is a chunk of text the model reads — sometimes a full word, sometimes a fragment. "Hello, world" might become three tokens: Hello, ,, world. The word "strawberry" might split into two: straw, berry. Each token is then mapped to a number, and those numbers are what the model actually processes.

Every model has its own tokenizer (the piece that does the chopping), so the same sentence in GPT, Claude, and Gemini can produce different token counts.

The practical consequence: context limits and pricing are measured in tokens, not words. A model's context is how much text it can work with at once — its working memory. Rule of thumb: 1 English word ≈ 1.3 tokens.

Every API charges two things separately: input tokens (everything the model reads — system prompt + conversation history + retrieved memory + tool definitions + the new user message) and output tokens (what the model generates, typically priced higher per token than input).

Every turn, the entire conversation history gets sent as input. By turn 10, the input includes 9 prior turns plus the new message. Token usage grows with chat length.

Agents

Agentic AI

A regular LLM chat is one prompt in, one response out. Agentic AI turns that into a loop: the LLM picks an action, the system runs it, the result feeds back, the LLM picks again. Actions come from tools — APIs, code execution, file edits, database queries.

There's a middle ground between chat and a full agent:

Workflow — a fixed pipeline of LLM calls. The path is hard-coded; the LLM just fills in the steps. Example: classify ticket → summarize → draft reply.
Agent — no fixed path. The LLM is given a goal and a set of tools, picks the next action, observes the result, and picks again. The loop continues until the goal is reached or a stop condition fires (step limit, token limit, failure).

The key difference: who decided the next step — a human writing the pipeline, or the LLM at runtime?

The agent loop

An agent is an LLM in a loop with tools and memory. One cycle:

The agent receives a task.
The LLM picks what to do next.
The agent runs a tool, calls an API, writes a file, etc.
The agent observes the result.
Back to step 2 with the new information.

You'll see this written different ways — perceive → reason → act → learn, or plan → act → adapt. Same loop.

Three components

An agent has three:

LLM — the reasoning engine.
Memory — what the agent remembers across steps and sessions.
Tools — what the agent can actually do (call APIs, run code, search files, edit data).

Strip any one and you're back to plain chat.

Four characteristics

Autonomous execution. Runs without step-by-step human input.
Goal-oriented. Works toward an objective, not a fixed script.
Proactivity. Initiates actions on its own.
Collaboration. Can work with other agents (multi-agent systems).

Memory

An LLM has no memory of its own. Each call to the model is independent — it doesn't remember anything from the previous call. Any continuity you experience in ChatGPT or Claude comes from the conversation history being passed back in every turn.

Memory is storage that gets auto-injected into the prompt. When the agent needs context, the system pulls relevant entries from storage and adds them to the prompt before the next LLM call.

How it works in practice:

The agent stores facts somewhere — JSON file, relational database, vector database, a dedicated memory service.
Each turn, the system retrieves whatever matches the current question (recent messages, facts about the user, prior decisions).
The retrieved content gets prepended or inserted into the prompt.
The LLM sees it as context, no different from anything else in the prompt.

Two kinds:

Short-term — the conversation so far. The running list of messages passed each turn.
Long-term — facts kept across sessions. ChatGPT's memory feature is the consumer example: tell it your favorite dish, come back next week, and it remembers — because "user's favorite dish is feijoada" got stored and gets injected the next time you ask.

Compressing history

Memory has a hard limit: the model's context window. Once a chat grows past that, something has to give — either older messages get dropped (the agent forgets) or they get compressed.

Agent systems handle this automatically. As the conversation approaches the context limit, the system makes a separate LLM call to summarize older messages, then carries the summary forward in place of the raw history. The LLM keeps working without hitting the wall.

Side effect: every future turn sends fewer tokens, which lowers cost.

Tools

A tool is something the agent can do beyond producing text. The LLM outputs "call tool X with these arguments," the system runs it, and the result feeds back into the next LLM call.

Common tools:

File operations (create, edit, delete, search)
Web search
API calls
PDF parsing
Code execution
Database queries

In API terms, this is called tool use (Anthropic) or function calling (OpenAI). You declare the available tools to the LLM up front; the model decides when to call them.

Without tools, the agent only produces text.

MCP (Model Context Protocol)

LLMs use tools to take action — file operations, database queries, web searches. Historically, each LLM provider defined its own format for declaring tools (OpenAI function calling, Anthropic tool use, Google function declarations), so the same integration had to be rewritten for each.

MCP is an open protocol that standardizes how applications expose tools to LLMs. Anthropic released it in November 2024; OpenAI added support in early 2025.

Three pieces:

MCP Server — exposes a set of tools. Examples: a file system server, a Postgres server, a GitHub server.
MCP Client — sits inside the LLM application, forwards tool calls to servers, and returns the results.
MCP Host — the application the user interacts with (Claude Desktop, Cursor, etc.). It runs the client.

Flow: the user talks to the host → the LLM decides to call a tool → the client routes the call to the right server → the server runs it → the result feeds back into the LLM.

Reference: modelcontextprotocol.io. Community-built servers are listed on GitHub.

RAG (Retrieval-Augmented Generation)

An LLM only knows what it was trained on, up to a cutoff date. Events from last week, your company's internal docs, today's stock prices — none of it is in the model.

RAG is the pattern of fetching relevant information from outside the model and injecting it into the prompt before the LLM generates its answer.

How it works:

The system detects that the question needs information the model doesn't have.
It queries an external source — web search, a vector database, a file system, an API.
The relevant chunks of content come back.
Those chunks get inserted into the prompt as context.
The LLM answers using the fresh content.

Example: asking Claude "who won last weekend's F1 race?" — the model doesn't know. Enable web search, the system retrieves a result page, passes the content to the model, and the model answers from the retrieved text.

RAG challenges

Irrelevant retrieval — the system pulls back content that doesn't help, and the LLM answers off-topic.
Chunking — long documents have to be split into pieces small enough to fit the prompt while preserving meaning.
Access control — when the source contains private data, the system has to respect permissions. The model shouldn't see what the user shouldn't.

Vector databases

A vector database stores text (or images, or other data) as arrays of numbers, then lets you search by similarity instead of exact match.

The numbers come from an embedding model. Feed it text (or an image, or audio) and it returns a list of numbers — typically hundreds to thousands of them — that captures the meaning of the input. Think of each embedding as a coordinate in a high-dimensional space where similar things sit close together.

Two examples to make this concrete:

The embeddings for "cat" and "dog" sit close together. "Helicopter" is far from both.
"How do I reset my password?" and "I forgot my login" sit close together, even though they share no words.

A vector database stores those coordinates and answers queries by returning the entries closest to your input.

Concrete use case: a company chatbot that answers questions from internal docs.

Setup (one time):

Break the company wiki into chunks (paragraphs, sections, whatever fits).
Run each chunk through an embedding model. Get back one vector per chunk.
Store the chunks alongside their vectors in a vector database (Pinecone, Chroma, pgvector).

Query time (every user message):

The user asks: "How do I file expense reports?"
You embed the question with the same embedding model.
The database returns the chunks whose vectors are closest to the question vector — say, the top 5.
Those chunks get injected into the LLM prompt: "Use this context to answer: [chunks]. Question: How do I file expense reports?"
The LLM answers using the retrieved content.

Common vector databases: Pinecone, Weaviate, Chroma, Milvus, pgvector (a Postgres extension).

To be continued

Part 1 covers the foundations: ML, transformers, tokens, agents, memory, tools, MCP, RAG, vector databases.

Part 2 will cover the applied side — agentic coding, global rules and CLAUDE.md, vibe coding vs context engineering, multi-agent systems, guardrails, observability, and CI/CD with agents.

DEV Community