In the previous post, we talked about context windows. The model has a fixed-size desk and everything has to fit on it at once. When too much is on the desk, things in the middle get missed.
I ended that post with a promise: what if there was a way to give the model just the right piece, at the right time, from a document you've never even pasted in?
That's this post. We're giving the model a search system.
The problem: your document is too long
You have a 2000-page document. An employee handbook, a product manual, internal documentation. You need one specific answer from it.
You can't paste the whole thing into the model's context window. And even if you found a model with a window big enough, we learned what happens: attention degrades, things in the middle get missed, and the model answers confidently from the wrong section.
So you need something different. A step that happens before the model sees anything. Something that finds the 2-3 paragraphs that actually answer your question, and passes only those to the model.
That's retrieval. The full technique is called RAG: Retrieval-Augmented Generation. Search first, then generate.
Three words, one loop
Let's break the name down. Each word is a step.
Retrieval. Go find relevant information. Think of it like checking the index of a textbook before diving into a chapter. You don't re-read the whole book. You find the right page first.
Augmented. Add that retrieved info to the prompt. You're supplementing the model's built-in knowledge with fresh, specific context. Like handing someone a cheat sheet right before they answer a question.
Generation. The model writes its response, but with the retrieved context sitting right there in the conversation. It generates an answer grounded in your actual data, not just its training. "Grounded" means the model has real evidence to point to. It's not guessing from memory. It's answering from something you gave it.
The whole loop in one sentence: find the right chunks of information, stuff them into the prompt, let the model answer using that context. That's it. That's RAG.
And if you're thinking "wait, isn't this just enterprise search?" you're not wrong. Tools like Elasticsearch, Kendra, SharePoint search have been finding relevant passages in documents for decades. The retrieval part isn't new. What's new is the last step: instead of showing you a results page to read for yourself, a foundation model reads the evidence and writes the answer. To put it simply, RAG is enterprise search with a language model at the end of the pipeline.
The setup: onboarding docs for a fictional company
Imagine you just joined a new company and on the first day they hand you a bunch of documents. Employee handbook, benefits guide, leave policy, expense rules, engineering onboarding, IT security. Six documents with thousands of lines. All the answers are in there somewhere, but you'd have to read all of them to find what you need.
I've got a fictional company here, PineRidge Solutions. These are their onboarding docs.
The goal: I type a question like "how many vacation days do I get?" or "what's the parental leave top-up?" and the system finds the right section and answers from it.
I'm building this in Kiro IDE, and for the models, I'm using Amazon Bedrock, the same tool we've been using for the last four posts. Except now, instead of the Playground in AWS Console, I'm calling it through my code.
Please note, I'm using Bedrock here, but this same pattern works with any embeddings model locally or on Cloud. Ollama locally, OpenAI, Cohere, whatever. The pipeline is the same. The model is just a plug.
All the code mentioned in this post is available in my GitHub repo here.
Three steps to build. Chunk, embed, retrieve. Let's go.
Step 1: Chunk the document
Before anyone can search these documents, they need to be broken into smaller pieces. Chunks. Usually a few paragraphs each.
Why? Because the goal is to return just the relevant section, not everything. If I keep each document as one giant block, the search will return entire files when I only need a paragraph.
How you split matters. Too large, and you're back to the "too much context" problem. Too small, and you might cut an answer in half.
Let's take a simple example.
Say the leave policy has three sentences: "The standard vacation policy grants 15 days per year. However, employees in their first year receive only 10 days. These days do not carry over into the next calendar year."
If I chunk without overlap, I might split after the second sentence. The next chunk starts with "These days do not carry over into the next calendar year." Now if someone asks "do my vacation days carry over?" the system retrieves that chunk. It answers "these days do not carry over." But which days? The standard 15? The first-year 10? The word "these" has lost its referent. The chunk is meaningless on its own.
With overlap, the last sentence of chunk one repeats at the start of chunk two. Both chunks make sense independently.
Here's the code. Read each file, split into pieces:
def chunk_docs_paragraph(folder: str) -> list[dict]:
"""Paragraph-based chunking with 1 paragraph of overlap."""
chunks = []
for filename in sorted(os.listdir(folder)):
if not filename.endswith(".md"):
continue
with open(os.path.join(folder, filename), "r") as f:
text = f.read()
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
for i in range(len(paragraphs)):
start = max(0, i - 1) # 1 paragraph overlap
chunk_text = "\n\n".join(paragraphs[start : i + 1])
chunks.append({"text": chunk_text, "source": filename})
return chunks
From six onboarding documents, I get about 150 chunks. Each one is roughly a paragraph or two. A self-contained piece of text.
Step one done. Now I need to make these searchable.
Step 2: Turn chunks into embeddings
Here's the concept that makes the whole thing work. Each chunk gets turned into a set of numbers called an embedding.
The name is a literal mathematical term. You're taking text and placing it into a space made of numbers. In that space, distance has meaning. Two chunks about similar things end up close together. Two chunks about different topics end up far apart. "Parental leave top-up" and "salary during maternity leave" would be near each other numerically, even though the actual words are completely different. That's what makes this useful: an embedding captures meaning, not exact words.
Think of it like a library's index card system. The card doesn't contain the whole book. It captures enough about the content to help you find the right book when someone asks.
A specialised model called an embeddings model does this conversion for us. It's not the same model that generates your answer. It's a different model for a different job. The embeddings model is small and fast. It turns text into searchable numbers.
import boto3, json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
def embed_text(text: str) -> list[float]:
"""Call Titan Embeddings V2 to get a 1024-dim vector."""
response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": text}),
)
result = json.loads(response["body"].read())
return result["embedding"]
Each chunk now has a numerical fingerprint. That's my searchable index.
Now you'll hear the term "vector" a lot. It just means a list of numbers with a direction. Think of it as coordinates. An embedding is the concept, a vector is the format it's stored in.
Right now these vectors are sitting in a Python list on my laptop. If I close this script, they're gone. For this demo, I'm caching them to a local file so I don't re-embed every time I run the script. But for a production system with thousands of documents, you'd store them somewhere proper. AWS recently launched Amazon S3 Vectors, which is literally what it sounds like: S3 built for storing and searching vectors natively. There's also OpenSearch Serverless, pgvector if you want Postgres, or Amazon Bedrock Knowledge Bases which handles the whole pipeline as a managed service.
Step two done. Now, the search.
Step 3: Retrieve and generate
Someone asks a question. The question gets embedded with the same model. Same kind of numbers. Then we compare the question's numbers against all the chunk numbers. The closest matches are my search results.
This is semantic search. It matches by meaning, not by exact words.
If the handbook says "remote work policy" and I ask about "working from home rules," it catches the match because the meaning is close.
import numpy as np
def retrieve(question: str, chunks: list[dict], embeddings: np.ndarray, top_k: int = 3):
"""Find the top-K most relevant chunks via cosine similarity."""
q_vec = np.array(embed_text(question))
scores = []
for i in range(len(chunks)):
score = np.dot(q_vec, embeddings[i]) / (
np.linalg.norm(q_vec) * np.linalg.norm(embeddings[i])
)
scores.append(score)
top_indices = np.argsort(scores)[::-1][:top_k]
return [chunks[idx] for idx in top_indices]
The top 3 chunks are my evidence. Now I pass them to a generation model alongside the question. Titan did the embeddings. Claude does the answering.
def generate_answer(question: str, retrieved: list[dict]) -> str:
"""Pass retrieved chunks + question to Claude."""
context = "\n\n---\n\n".join(
f"[Source: {r['source']}]\n{r['text']}" for r in retrieved
)
prompt = (
f"Use ONLY the context below. If the answer isn't there, say so.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
response = bedrock.converse(
modelId="us.anthropic.claude-haiku-4-5-20251001-v1:0",
messages=[{"role": "user", "content": [{"text": prompt}]}],
)
return response["output"]["message"]["content"][0]["text"]
I asked: "What's the RRSP matching policy?"
The system retrieved the right section from the benefits guide. The answer came back grounded in the actual policy document: dollar-for-dollar match up to 5% of base salary, starts after 90 days, vesting schedule. Not from the model's training data, from the company's files. And I can see exactly which chunks were used to build that answer. That's my citation. I can point to the source.
The full pipeline. Chunk, embed, retrieve, generate. Running on my laptop. About 60 lines of Python. And it works.
Where it breaks: a quick preview
So this works great when retrieval finds the right piece. But watch this.
I asked: "How many vacation days do I get as a senior engineer?" Retrieval actually works. It finds the vacation table from the benefits guide. But the model says "I don't know which level a senior engineer is." The right information was retrieved, but the answer needed two pieces of context that aren't in the same chunk: what level maps to "senior engineer," and how many days that level gets.
That's the kind of thing that breaks. Retrieval succeeded, but the answer still failed. The model wasn't hallucinating. It was honest about what it couldn't determine from the evidence it had.
This is not a hallucination in the way we talked about in the hallucinations post. The model didn't invent something from nothing. It was given real text from the real document. But the retrieved chunks didn't contain everything needed to answer the question.
When a RAG system gives you a bad answer, the question to ask is: "what chunk did it retrieve?" Not "why is the model wrong?"
We'll diagnose and fix this properly in the next post.
Key takeaways
If you're just getting started: RAG is how you get AI to answer questions about your documents without pasting everything into the chat. It searches first, then answers from what it finds. Three steps: chunk, embed, retrieve. The model never sees the full document. Just the pieces that match your question.
If you're more on the builder side: RAG is a pipeline with independently tunable steps. Chunking strategy, embedding model, retrieval method, and generation model each affect quality on their own. Also worth noting: different models for different jobs in the same pipeline. Titan Embeddings for search (fast, cheap). Claude for generation (smart, conversational). You'll see this pattern everywhere in AI systems.
What's next
So this works great when retrieval finds the right piece. But what happens when the chunks are too small and the answer gets cut in half? What if the question needs information scattered across multiple sections? What if retrieval succeeds but the answer still fails because context is split across chunks?
Next post, we break this thing on purpose. Then we fix it. And I'll walk through the full toolkit of strategies that make retrieval actually reliable.
Ride along.
This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.









Top comments (1)
What's your chunking strategy? Paragraphs, sentences, fixed token count? Curious what's working for people.