Rohini Gaonkar for AWS

Posted on Jun 11 • Edited on Jun 15

How to make AI answer questions about your documents

#ai #aws #tutorial #rag

In the previous post, we talked about context windows. The model has a fixed-size desk and everything has to fit on it at once. When too much is on the desk, things in the middle get missed.

I ended that post with a promise: what if there was a way to give the model just the right piece, at the right time, from a document you've never even pasted in?

That's this post. We're giving the model a search system.

The problem: your document is too long

You have a 2000-page document. An employee handbook, a product manual, internal documentation. You need one specific answer from it.

You can't paste the whole thing into the model's context window. And even if you found a model with a window big enough, we learned what happens: attention degrades, things in the middle get missed, and the model answers confidently from the wrong section.

So you need something different. A step that happens before the model sees anything. Something that finds the 2-3 paragraphs that actually answer your question, and passes only those to the model.

That's retrieval. The full technique is called RAG: Retrieval-Augmented Generation. Search first, then generate.

Retrieval-Augmented Generation

Let's break the name down. Each word is a step.

Retrieval.
Go find relevant information. Think of it like checking the index of a textbook before diving into a chapter. You don't re-read the whole book. You find the right page first.

Augmented.
Add that retrieved info to the prompt. You're supplementing the model's built-in knowledge with fresh, specific context. Like handing someone a cheat sheet right before they answer a question.

Generation.
The model writes its response, but with the retrieved context sitting right there in the conversation. It generates an answer grounded in your actual data, not just its training. "Grounded" means the model has real evidence to point to. It's not guessing from memory. It's answering from something you gave it.

The whole loop in one sentence: find the right chunks of information, stuff them into the prompt, let the model answer using that context. That's it. That's RAG.

And if you're thinking "wait, isn't this just enterprise search?" you're not wrong.

Tools like Elasticsearch, Kendra, SharePoint search have been finding relevant passages in documents for decades. The retrieval part isn't new. What's new is the last step: instead of showing you a results page to read for yourself, a foundation model reads the evidence and writes the answer.

To put it simply, RAG is enterprise search with a language model at the end of the pipeline.

The setup: onboarding docs for a fictional company

Imagine you just joined a new company and on the first day they hand you a bunch of documents. Employee handbook, benefits guide, leave policy, expense rules, engineering onboarding, IT security. Six documents with thousands of lines. All the answers are in there somewhere, but you'd have to read all of them to find what you need.

I've got a fictional company here, PineRidge Solutions. These are their onboarding docs.

The goal: I type a question like "how many vacation days do I get?" or "what's the parental leave top-up?" and the system finds the right section and answers from it.

I'm building this in Kiro IDE, and for the models, I'm using Amazon Bedrock, the same tool we've been using for the last four posts. Except now, instead of the Playground in AWS Console, I'm calling it through my code.

Please note, I'm using Bedrock here, but this same pattern works with any embeddings model locally or on Cloud. Ollama locally, OpenAI, Cohere, whatever. The pipeline is the same. The model is just a plug.

All the code mentioned in this post is available in my GitHub repo here.

Three steps to build. Chunk, embed, retrieve. Let's go.

Step 1: Chunk the document

Before anyone can search these documents, they need to be broken into smaller pieces. Chunks. Usually a few paragraphs each.

Why? Because the goal is to return just the relevant section, not everything. If I keep each document as one giant block, the search will return entire files when I only need a paragraph.

How you split matters. Too large, and you're back to the "too much context" problem. Too small, and you might cut an answer in half.

Let's take a simple example.

Say the leave policy has three sentences: "The standard vacation policy grants 15 days per year. However, employees in their first year receive only 10 days. These days do not carry over into the next calendar year."

If I chunk without overlap, I might split after the second sentence. The next chunk starts with "These days do not carry over into the next calendar year."

Now if someone asks "do my vacation days carry over?" the system retrieves that chunk. It answers "these days do not carry over." But which days? The standard 15? The first-year 10? The word "these" has lost its referent. The chunk is meaningless on its own.

With overlap, the last sentence of chunk one repeats at the start of chunk two. Both chunks make sense independently.

Here's the code:

def chunk_docs_paragraph(folder: str) -> list[dict]:
    """Paragraph-based chunking with 1 paragraph of overlap."""
    chunks = []

    for filename in sorted(os.listdir(folder)):
        if not filename.endswith(".md"):
            continue

        with open(os.path.join(folder, filename), "r") as f:
            text = f.read()

        # Split document into paragraphs (separated by blank lines)
        paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

        for i in range(len(paragraphs)):
            # Include 1 paragraph of overlap for context continuity
            start = max(0, i - 1)  
            chunk_text = "\n\n".join(paragraphs[start : i + 1])

            # Store the chunk text and which file it came from (for citations)
            chunks.append({"text": chunk_text, "source": filename})

    return chunks

The funtion loops through every markdown file in the folder, reads it, and splits on blank lines to get paragraphs. Then for each paragraph, it includes one paragraph of overlap, the one before it, so nothing gets lost at the boundary. Each chunk gets stored with the text and which file it came from, so later I know where the answer originated.

From six onboarding documents, I get about 150 chunks. Each one is roughly a paragraph or two. A self-contained piece of text.

Step one done. Now I need to make these searchable.

Step 2: Turn chunks into embeddings

Here's the concept that makes the whole thing work. Each chunk gets turned into a set of numbers called an embedding.

The name is a literal mathematical term. You're taking text and placing it into a space made of numbers. In that space, distance has meaning. Two chunks about similar things end up close together. Two chunks about different topics end up far apart.

"Parental leave top-up" and "salary during maternity leave" would be near each other numerically, even though the actual words are completely different. That's what makes this useful: an embedding captures meaning, not exact words.

Think of it like a library's index card system. The card doesn't contain the whole book. It captures enough about the content to help you find the right book when someone asks.

A specialised model called an embeddings model does this conversion for us. It's not the same model that generates your answer. It's a different model for a different job. The embeddings model is small and fast. It turns text into searchable numbers.

import boto3, json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def embed_text(text: str) -> list[float]:
    """Call Titan Embeddings V2 to get a 1024-dim vector."""
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text}),
    )
    result = json.loads(response["body"].read())
    return result["embedding"]

Each chunk now has a numerical fingerprint. That's my searchable index.

Now you'll hear the term "vector" a lot. It just means a list of numbers with a direction. Think of it as coordinates.

An embedding is the concept, a vector is the format it's stored in.

Right now these vectors are sitting in a Python list on my laptop. If I close this script, they're gone. For this demo, I'm caching them to a local file so I don't re-embed every time I run the script. But for a production system with thousands of documents, you'd store them somewhere proper. AWS recently launched Amazon S3 Vectors, which is literally what it sounds like: S3 built for storing and searching vectors natively. There's also OpenSearch Serverless, pgvector if you want Postgres, or Amazon Bedrock Knowledge Bases which handles the whole pipeline as a managed service.

Step two done. Now, the search.

Step 3: Retrieve and Generate

Someone asks a question. The question gets embedded with the same model. Same kind of numbers. Then we compare the question's numbers against all the chunk numbers. The closest matches are my search results.

This is semantic search. It matches by meaning, not by exact words.

If the handbook says "remote work policy" and I ask about "working from home rules," it catches the match because the meaning is close.

import numpy as np

def retrieve(question: str, chunks: list[dict], embeddings: np.ndarray, top_k: int = 3):
    """Find the top-K most relevant chunks via cosine similarity."""

    # Embed the question into the same vector space as our chunks
    q_vec = np.array(embed_text(question))

    # Compare question vector against every chunk vector
    scores = []
    for i in range(len(chunks)):

        # Cosine similarity = dot product / (magnitude_a * magnitude_b)
        score = np.dot(q_vec, embeddings[i]) / (
            np.linalg.norm(q_vec) * np.linalg.norm(embeddings[i])
        )
        scores.append(score)

    # Sort by score descending, take top K
    top_indices = np.argsort(scores)[::-1][:top_k]

    return [chunks[idx] for idx in top_indices]

The retrieve function. It takes the question, embeds it with the same Titan model, so it's in the same number space as the chunks. Then it compares the question's numbers against every chunk's numbers using cosine similarity, which is just a way to measure how close two vectors are. Score of 1 means identical, 0 means completely unrelated. It sorts by score and returns the top 3.

The top 3 chunks are my evidence. Now I pass them to a generation model alongside the question. Titan did the embeddings. Claude does the answering.

def generate_answer(question: str, retrieved: list[dict]) -> str:
    """Pass retrieved chunks + question to Claude."""

    # Format retrieved chunks with their source for traceability
    context = "\n\n---\n\n".join(
        f"[Source: {r['source']}]\n{r['text']}" for r in retrieved
    )

    # System-style instruction followed by context and question
    prompt = (
        f"You are answering questions about PineRidge Solutions' company policies. "
        f"Use ONLY the context below. If the answer isn't there, say so.\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}"
    )

    # Call Claude via Bedrock's Converse API
    response = bedrock.converse(
        modelId="us.anthropic.claude-haiku-4-5-20251001-v1:0",
        messages=[{"role": "user", "content": [{"text": prompt}]}],
    )
    return response["output"]["message"]["content"][0]["text"]

The function generate_answer. It takes the retrieved chunks, labels each one with which file it came from, and builds a prompt. The prompt tells Claude: "You're answering questions about PineRidge company policies. Use ONLY the context below. If the answer isn't there, say so." Then it passes the context and the question to Claude via Bedrock's Converse API and returns the response.

I asked: "What's the RRSP matching policy?"

The system retrieved the right section from the benefits guide. The answer came back grounded in the actual policy document: dollar-for-dollar match up to 5% of base salary, starts after 90 days, vesting schedule. Not from the model's training data, from the company's files. And I can see exactly which chunks were used to build that answer. That's my citation. I can point to the source.

The full pipeline. Chunk, embed, retrieve, generate. Running on my laptop. About 60 lines of Python. And it works.

Where it breaks: a quick preview

So this works great when retrieval finds the right piece. But watch this.

I asked: "How many vacation days do I get as a senior engineer?" Retrieval actually works. It finds the vacation table from the benefits guide. But the model says "I don't know which level a senior engineer is." The right information was retrieved, but the answer needed two pieces of context that aren't in the same chunk: what level maps to "senior engineer," and how many days that level gets.

That's the kind of thing that breaks. Retrieval succeeded, but the answer still failed. The model wasn't hallucinating. It was honest about what it couldn't determine from the evidence it had.

This is not a hallucination in the way we talked about in the hallucinations post. The model didn't invent something from nothing. It was given real text from the real document. But the retrieved chunks didn't contain everything needed to answer the question.

When a RAG system gives you a bad answer, the question to ask is: "what chunk did it retrieve?" Not "why is the model wrong?"

We'll diagnose and fix this properly in the next post.

Key takeaways

If you're just getting started: RAG is how you get AI to answer questions about your documents without pasting everything into the chat. It searches first, then answers from what it finds. Three steps: chunk, embed, retrieve. The model never sees the full document. Just the pieces that match your question.

If you're more on the builder side: RAG is a pipeline with independently tunable steps. Chunking strategy, embedding model, retrieval method, and generation model each affect quality on their own. Also worth noting: different models for different jobs in the same pipeline. Titan Embeddings for search (fast, cheap). Claude for generation (smart, conversational). You'll see this pattern everywhere in AI systems.

What's next

So this works great when retrieval finds the right piece. But what happens when the chunks are too small and the answer gets cut in half? What if the question needs information scattered across multiple sections? What if retrieval succeeds but the answer still fails because context is split across chunks?

Next post, we break this thing on purpose. Then we fix it. And I'll walk through the full toolkit of strategies that make retrieval actually reliable.

Ride along.

This post is part of the "Learning AI Out Loud" series, a cloud architect learning AI from first principles.

Follow along with the series

Top comments (6)

Rohini Gaonkar AWS • Jun 11

What's your chunking strategy? Paragraphs, sentences, fixed token count? Curious what's working for people.

Vinoth kumar • Jun 16

I lately found chonkie, which is optimised for just chunking the text in different format. for Long context Late chunker with 256 chunk size with 50 token overlap works great. for general purpose Text chunker with smaller standard chunk size is recommended. i would love hear more about the Chunking strategy because they directly affect, how retrieval and answer is done.

Rohini Gaonkar AWS • Jun 16

Thank you for sharing about chonkie! You're right that chunking strategy directly shapes retrieval quality. That "these days" problem I showed, where a naive split loses the referent, is exactly the kind of thing smarter chunking strategies solve.

I'm actually covering this deeper in my next post, including when to use heading-based, semantic, and parent-child chunking. Stay tuned.