Jayaprasanna Roddam

Posted on May 19

Gen AI at a glance

1. What is an LLM?

LLM = Large Language Model

At its core, it's a program that answers one question repeatedly:

"Given everything I've seen so far, what word comes next?"

That's it. It's a next-token predictor. Everything else — reasoning, coding, conversation — is an emergent behavior that falls out of doing this extremely well at massive scale.

It's trained by reading a huge chunk of the internet, books, code, etc., and learning the statistical patterns of language. Not rules. Not logic trees. Just patterns, at an incomprehensible scale.

2. What is a Token?

A token is the unit of text the LLM works with. It's not exactly a word — it's more like a word-chunk.

"I am learning GenAI"
→ ["I", " am", " learning", " Gen", "AI"]  ← 5 tokens

Why tokens and not letters or words?

Letters → too granular, sequences get too long
Words → the vocabulary explodes (every tense, plural, compound word)
Tokens → a sweet middle ground (~50,000 common tokens cover most English)

Why you care as a developer:

APIs charge per token
Models have a context window limit (e.g., 200k tokens for Claude) — that's how much text it can "see" at once
~1 token ≈ 0.75 words, so 200k tokens ≈ a ~150,000 word novel

3. How Does the LLM Actually Learn? (Training, simply)

During training, the model sees billions of sentences. For each one, it tries to predict the next token, checks if it was right, and slightly adjusts its internal numbers (called weights) to do better next time.

Do this a few trillion times across a cluster of GPUs for weeks, and you get a model that has compressed a huge amount of human knowledge into those weights.

The model is essentially a giant mathematical function:

f(text input) → probability distribution over what comes next

Pick the most likely token. Append it. Feed it back in. Repeat. That's generation.

4. What is an Embedding?

This is where it gets interesting.

Somewhere inside the LLM, before it predicts the next token, it converts every token into a vector — a list of numbers. Like this:

"cat"  → [0.21, -0.54, 0.87, 0.03, ... ]  (1536 numbers)
"dog"  → [0.19, -0.51, 0.85, 0.01, ... ]  (1536 numbers)
"car"  → [0.80,  0.22, -0.40, 0.60, ... ]  (1536 numbers)

The magic: similar meaning = similar numbers. Cat and dog are close together in this 1536-dimensional space. Car is far away.

This is called an embedding — a dense numerical representation of meaning.

You can now do math on meaning:

king - man + woman ≈ queen   ← the famous example

Why you care: Embeddings are what power search that understands intent, not just keywords. They're the foundation of RAG.

5. Vector Databases

Now you have millions of documents, each converted to an embedding (a list of ~1536 numbers). You need to store them and find the most similar one to a query — fast.

A regular database can't do this. SELECT * WHERE embedding = ? doesn't work for "find the closest vector."

A vector database is purpose-built for nearest-neighbor search in high-dimensional space.

Query: "how do I reset my password?"
→ Convert to embedding: [0.33, -0.12, 0.76, ...]
→ Search vector DB for closest stored embeddings
→ Returns: your docs about login, authentication, account recovery

Popular ones: Pinecone, Weaviate, Qdrant, pgvector (Postgres extension), Chroma (local/dev).

6. Now RAG Makes Complete Sense

Put it all together:

Step 1 — Indexing (done once, offline):
  Your docs → split into chunks → embed each chunk → store in vector DB

Step 2 — Querying (every time a user asks something):
  User question → embed it → find top-K similar chunks in vector DB
  → inject those chunks into the LLM prompt → LLM answers

The LLM never "knows" your data permanently. It's given the relevant pieces each time, right in the prompt. That's the "augmented" part of RAG.

Why not just put all your docs in the context window?

Your codebase might be millions of tokens — too big
Cost: you're paying per token
Performance: LLMs get worse with very long contexts ("lost in the middle" problem)
RAG picks only the relevant chunks

7. The Prompt & Context Window

When you talk to an LLM, you're not having a real "conversation" — you're constructing a text document (the prompt) and the model predicts what comes after it.

A prompt has several parts:

[System prompt]     ← Instructions, persona, rules ("You are a helpful assistant...")
[Past messages]     ← The conversation history, manually included
[RAG context]       ← Relevant docs retrieved for this query
[User message]      ← What the user just said
[Assistant: ]       ← Model starts generating here

The context window is how much of this total document the model can "see." Everything outside it is simply forgotten. There's no persistent memory — that's why apps have to re-inject conversation history every single time.

8. Temperature & Sampling

When the model outputs a probability distribution over the next token, how do you pick?

Greedy: always pick the most likely → deterministic but repetitive
Temperature: controls randomness
- temp = 0 → nearly deterministic, factual
- temp = 1 → balanced
- temp = 2 → creative/chaotic

This is why you can ask the same question twice and get different answers.

9. Fine-tuning vs. Prompting vs. RAG

These are the three levers for customizing LLM behavior:

	What it changes	Cost	When to use
Prompting	Nothing permanent, just guides generation	Free	Default choice
RAG	What context the model sees	Low-Medium	Private/fresh data
Fine-tuning	The model's actual weights	High	Style, tone, specialized domain behavior

A common mistake is reaching for fine-tuning when RAG would work. Fine-tune when you need the model to behave differently. RAG when you need it to know something.

10. Agents & Tool Use

An LLM by itself just generates text. An agent is when you give the LLM tools and let it decide when to use them.

The loop:

1. User asks something
2. LLM thinks: "I need to search for that" → emits a tool call (structured JSON)
3. Your code executes the tool (runs a search, queries a DB, calls an API)
4. Result is injected back into the context
5. LLM thinks again with new information
6. Repeat until it has enough to answer

Tool use is just the LLM outputting structured data instead of prose, and your code catching it and running something.

The Full Stack, Visualized

┌─────────────────────────────────────────┐
│              Your App / Chat UI         │
└──────────────┬──────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│            Orchestration Layer          │
│  (builds prompts, manages memory,       │
│   routes tool calls, handles RAG)       │
└──────┬──────────────────┬───────────────┘
       │                  │
┌──────▼──────┐   ┌───────▼───────┐
│  Vector DB  │   │  MCP Servers  │
│ (your docs) │   │ (Slack, Jira, │
│             │   │  code, APIs)  │
└─────────────┘   └───────────────┘
       │                  │
┌──────▼──────────────────▼───────────────┐
│                  LLM                    │
│   (tokens → embeddings → attention      │
│    → next token prediction)             │
└─────────────────────────────────────────┘