Tomás Garcia

Posted on Apr 10

The AI Development Stack: Fundamentals Every Developer Should Actually Understand

#ai #beginners #llm #programming

Most developers are already using AI tools daily — Copilot, Claude, ChatGPT. But when it comes to building with AI, there's a gap. Not in tutorials or API docs, but in the foundational mental model of how these systems actually work and fit together.

This is the stuff I wish someone had laid out clearly when I started building AI-powered features. Not the hype, not the theory — the practical fundamentals that change how you architect, debug, and think about AI systems.

Language Models: What's Actually Happening

A Language Model (LM) is a neural network that encodes statistical information about language. Intuitively, it tells you how likely a word is to appear in a given context. Given "my favorite color is ___", a well-trained LM should predict "blue" more often than "car."

The atomic unit here is the token — which can be a character, a word, or a subword (like "tion") depending on the model's tokenizer.

A Large Language Model (LLM) is just an LM trained on massive amounts of data using self-supervised learning. The key distinction isn't just scale — it's that at scale, capabilities emerge that were never explicitly programmed. An LM predicts the next token. An LLM does it at such scale that reasoning, coding, and creative abilities appear as emergent properties.

Foundation Models (FMs) is the broadest term. It includes both LLMs (text-only) and Large Multimodal Models (LMMs), which can process text, images, video, audio, and 3D assets.

What Is an Agent?

An agent is a system that uses an LLM to operate in a loop: it reasons about what to do, takes action (tool calls, code execution, API calls), observes the result, and repeats until the task is complete.

The basic loop looks like this:

THINK — the agent receives the current context and decides what to do: respond directly, or call a tool.

ACT — if it decided to use a tool, it executes it (web search, DB query, API call).

OBSERVE — the result gets added to the context, and the cycle starts again.

The loop terminates when the model has enough information to give a final answer, or when an external limit is reached (max iterations, timeout).

This is deceptively simple. But every meaningful AI product you've used — from Claude Code to Cursor to Devin — is some variation of this loop.

Tools: How LLMs Touch the Real World

A tool is an external function that the agent can invoke to interact with the world outside the LLM.

Here's what's important to understand: the LLM by itself only generates text. Tools are what let it do real things — fetch live information, read files, execute code, call APIs, write to a database.

Concrete examples:

web_search("dollar price today")
query_db("SELECT * FROM orders WHERE status = 'pending'")
send_email(to="client@mail.com", body="...")

Without tools, an LLM is a very sophisticated autocomplete. With tools, it becomes an agent that can actually operate in your environment.

Context: The Model's Working Memory

Context is all the information the agent has "in memory" at a given moment to generate a coherent response. Think of it as a text box the model reads in its entirety on every call.

It contains:

System prompt — base instructions defining the model's behavior
Documents — reference material injected for the task
User message — the actual request
Previous responses — conversation history
Tool results — outputs from tool executions

The context has a physical limit called the context window, measured in tokens. Anything that doesn't fit in that window, the model simply doesn't see.

This is why context design matters so much when building agents. The system prompt, the conversation history you preserve, what you include and what you drop — all of that directly impacts response quality, latency, and cost.

Memory: Beyond the Context Window

Memory is the mechanism that allows an agent to access information beyond its context window.

Two concrete examples you're probably already using:

Claude.ai — at the start of every conversation, the context is empty. What it "remembers" from past chats is because Anthropic injects a summary of previous conversations into the context before you start typing.

Claude Code — when you're working on a project, it reads files like CLAUDE.md, the directory tree, and relevant codebase files. It doesn't "know" them from memory — it loads them into context when needed, via tools.

The key insight: there is no magic persistence. Everything the model "remembers" was explicitly loaded into the context window for that specific call.

Prompting: The Developer's Primary Interface

Prompting is the skill of giving instructions to an LLM to get the output you want. It's the primary interface between you and the model.

What the LLM receives isn't just what the user types. A complete message typically includes:

System prompt — base instructions defining behavior, role, constraints, response format
User prompt — the user's message
Context — conversation history, tool results, relevant documents, retrieved memory
Available tools — the list of functions the agent can invoke, with their descriptions and parameters

All of that together is what the LLM "reads" before generating its response.

Core techniques:

Zero-shot — you ask directly without examples.
"Translate this text to English"

Few-shot — you provide examples of expected behavior before the question.
"Input: 'loved it' → Sentiment: positive. Input: 'disgusting' → Sentiment: negative. Input: 'it was okay' → Sentiment:"

Chain of thought — you ask the model to reason step by step before answering.
"Think step by step before responding"

A practical rule: the model doesn't guess your intent, it only predicts the next token. The clearer and more specific the prompt, the more predictable and useful the output.

Evals: Testing in a Non-Deterministic World

An LLM is not a deterministic function. The same input can produce different outputs on every run.

This breaks something fundamental for developers: you can't write an assert on an LLM's response.

# this doesn't work with LLMs
assert llm.respond("Capital of France?") == "Paris"
# it might respond "Paris.", "The capital is Paris", "París"...

The conceptual distinction matters:

Test — verifies that a function produces an exact, predictable output given an input. Pass or fail. Works when the system is deterministic.

Eval — measures how good a response is according to one or more criteria: relevance, coherence, correctness, tone. Produces a score, not a boolean.

For most open-ended tasks, a perfect reference answer doesn't exist. This led to AI-as-a-Judge, where one AI model evaluates the output of another. It's popular because it's fast, scalable, and can evaluate subjective criteria like creativity or coherence without needing reference text.

But it has known limitations: AI judges have biases like position bias (favoring the first response in a comparison) and verbosity bias (preferring longer answers even when they contain errors).

Guardrails: The Safety Net You Need

Guardrails protect the system both from malicious inputs and problematic outputs.

They operate in two layers:

Input Guardrails prevent prompt injection attacks and filter sensitive data (PII) before it reaches external APIs.

Output Guardrails verify the model's responses for toxicity, factual inconsistencies, and format errors — typically using a fast classifier or an AI judge before showing the response to the user.

The reasoning is straightforward: since the LLM is probabilistic, you can't guarantee it will always behave as expected. Guardrails implement checks at both ends.

User input
    ↓
[Input Guardrail]  ← PII, prompt injection, malicious content
    ↓
   LLM
    ↓
[Output Guardrail] ← toxicity, hallucinations, bad formatting
    ↓
Response

The trade-off: guardrails add latency to every response. It's a cost worth paying for production systems, but you need to be intentional about what you check and how.

MCP (Model Context Protocol): The USB-C of AI Tools

Before MCP, if you wanted an agent to use an external tool — say, search Notion, query a database, or read a Google Drive file — you had to implement that integration yourself: authentication, request formatting, error handling, and then describe it to the LLM in the system prompt so it knew how to use it.

The problem: every agent, every LLM, every app was reimplementing the same integrations from scratch.

MCP is a standard interface between agents and external tools — it defines how an LLM discovers, invokes, and receives results from tools, regardless of who implemented them.

Two components:

MCP Server — exposes tools to the agent. Can be local (a process running on your machine) or remote (a cloud service). Implements the concrete tools: read files, query APIs, execute code.

MCP Client — the agent or app that consumes the tools. Connects to the server, discovers available tools, and invokes them during the think-act-observe loop.

[Agent / MCP Client]
    ↓  "what tools do you have?"
[MCP Server]
    ↓  "I have: read_file, search_notion, query_db"
[Agent]
    ↓  calls read_file("README.md")
[MCP Server]
    ↓  returns the content
[Agent]  ← adds result to context and continues

In Claude Code, for example, you act as the MCP Client. You can add MCP Servers with a simple command — claude mcp add server-name — and from that moment Claude Code has access to whatever tools that server exposes. A Postgres MCP Server gives Claude Code the ability to query your database directly during a development session.

RAG (Retrieval-Augmented Generation): Grounding Responses in Your Data

The problem: the LLM's knowledge is limited to its training data. It knows nothing about your codebase, your internal docs, real-time data, or anything after its knowledge cutoff date.

RAG is the pragmatic alternative: instead of teaching the model, you pass it the relevant information in context right before it responds.

The flow:

User question
    ↓
[Search] ← finds the most relevant fragments
           in a vector database (fed with document chunks)
    ↓
[Augmented context] ← question + relevant fragments
    ↓
   LLM
    ↓
Response grounded in those documents

The three components:

Ingestion — documents are split into fragments (chunks) and converted into vectors (embeddings) that represent their semantic meaning. Stored in a vector database.

Retrieval — when a question arrives, it's also converted into a vector and the most semantically similar fragments are retrieved from the database.

Generation — the retrieved fragments are injected into the LLM's context along with the question, and the model responds based on that information.

When to use RAG:

Chatbots over internal documentation or knowledge bases
Assistants that need real-time information (news, prices, live data)
Q&A over code, contracts, reports — any data the model doesn't know
Reducing hallucinations by anchoring responses to concrete sources

Developer Interfaces: How You Actually Use LLMs

An LLM can be consumed in different ways depending on the use case:

Web — the most accessible form. Go to a URL, type, get a response. Ideal for exploring, iterating on prompts, or one-off tasks. No code required. Examples: Claude.ai, ChatGPT, Gemini.

API — the programmatic form. You make an HTTP request and get the response in your code. It's the foundation of any product or agent you build. Gives you full control over the prompt, model, parameters, and integration with your system.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -d '{"model": "claude-sonnet-4-20250514",
       "messages": [{"role": "user", "content": "Hello"}]}'

CLI (Terminal) — command-line tools that wrap the API and let you interact with the LLM from your terminal, integrated into your development workflow. The most relevant example today is Claude Code: an agent that runs in your terminal, has access to your codebase, can read and write files, execute commands, and operates in the think-act-observe loop we already covered.

IDE — extensions that integrate the LLM directly into your editor. The model sees your code in context and can suggest, complete, refactor, or explain without leaving the environment. Examples: Cursor, GitHub Copilot, or the Claude extension for VS Code.

Putting It All Together

None of these concepts exist in isolation. When you use Claude Code to refactor a function, here's what's actually happening: the LLM is processing your request within a context window loaded with your system prompt, codebase files (loaded via tools), and conversation memory. It operates in an agent loop — thinking, acting, observing. The tools it uses to read and write your files might come through MCP servers. If it's pulling in documentation, that might be RAG at work. And somewhere in the pipeline, guardrails are ensuring the outputs are safe.

Understanding these fundamentals doesn't just help you use AI tools better — it's the foundation for building them.

DEV Community