Navas Herbert

Posted on Jul 1

Day 2 of 3: We Added an LLM and the Project Became a Real Product

#ai #rag #python #fastapi

I opened Day 2 with the same diagram I drew on Day 1.

Question → [embed] → [search vectors] → top-k chunks → [stuff into prompt] → LLM → Answer + sources

I pointed at the left half - everything up to and including top-k chunks. "Yesterday we built all of this. We proved retrieval works, with zero AI model involved. Distance numbers, real chunks, real search."

Then I pointed at the right half - [stuff into prompt] → LLM → Answer + sources.

"Today is this. Two new ideas, one new file. By the end of this session, your project stops being a search engine and becomes a product."

One New File: touch rag.py

Yesterday's project had three files. Today it gets one more.

That's it. One file. I wrote its job on the board before we touched it: "rag.py has exactly two jobs - build a prompt that forces the model to answer only from our chunks, and call an LLM with that prompt. It never touches ChromaDB. It never touches FastAPI. It just receives chunks and returns an answer."

The updated folder structure:

interdoc/
    |--app/
           |---- ingestion.py
           |---- vectorstore.py
           |---- rag.py          ← new today
           |---- main.py

File 4: rag.py - The Heart of Day 2

import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

I stopped immediately at line one. "Notice the API key never appears as text anywhere in our code. It lives in .env, which must be in .gitignore. If this project ever goes on GitHub, that key must never go with it."

build_prompt() - The Most Important Function of the Week

def build_prompt(question: str, chunks: list[str], metadatas: list[dict]) -> str:
    blocks = []
    for i, (chunk, meta) in enumerate(zip(chunks, metadatas), start=1):
        blocks.append(f"[Source {i}: {meta['source']}, chunk {meta['chunk_index']}]\n{chunk}")
    context = "\n\n".join(blocks)
    return (

"Use the context below to answer the question" - curbs the model from reaching into its own training data and answering from memory.

"If context does not contain the answer, say so politely or say you don't have enough information - do not guess" - this is the safety valve against hallucination. I told the class to remember this line specifically. Every RAG system that gives confidently wrong answers is missing this instruction somewhere.

"Reference relevant source using [Source N]" - we are telling the model the exact citation format to use. We built the [Source 1: filename, chunk N] labels in the blocks loop above, so the model can actually reference them. The citation capability comes from us, not from the model.

One sentence in the prompt is the entire difference between a trustworthy system and a hallucinating one.

generate_answer() - Calling the Model

def generate_answer(question: str, search_results: dict) -> dict:
    chunks = search_results["documents"][0]
    metadatas = search_results["metadatas"][0]
    if not chunks:
        return {"answer": "No documents have been uploaded yet.", "sources": []}

    prompt = build_prompt(question, chunks, metadatas)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}],
    )

    answer_text = response.choices[0].message.content
    sources = [{"source": m["source"], "chunk_index": m["chunk_index"]} for m in metadatas]
    return {"answer": answer_text, "sources": sources}

search_results is exactly what search_chunks() from yesterday returns - we didn't change that function at all. Day 2 plugs into Day 1's output without touching it.

The if not chunks guard at the top: "Always check before calling an expensive API. If nobody has uploaded anything yet, we return a clear message immediately instead of making an API call that would produce a useless response."

The sources list at the bottom: "Why do we return sources as a separate structured list instead of just leaving the citations inside the answer text? Because whoever builds a frontend on Day 3 can now display sources separately - highlight them, make them clickable, show the actual chunk text on hover. Buried inside an answer string, that's impossible."

Adding /query to main.py

We didn't rewrite main.py. We added to it - keeping everything from Day 1 exactly as it was.

from pydantic import BaseModel
from app.vectorstore import search_chunks
from app.rag import generate_answer

class QueryRequest(BaseModel):
    question: str
    top_k: int = 4

@app.post("/query")
async def query_documents(request: QueryRequest):
    results = search_chunks(request.question, top_k=request.top_k)
    return generate_answer(request.question, results)

I pointed at the endpoint body. "Two lines of real logic. Retrieve, then generate. Every endpoint we've written follows the same shape: receive the request, delegate the work to the right specialist file, return the result. main.py still doesn't do any real work itself."

Someone asked why there's no response_model here the way the Grade Tracker endpoints had last week. Good question — the answer structure coming back from generate_answer() is a plain dict that varies depending on how many sources come back. We'd need a Pydantic model to enforce a strict shape. For today's session that's a Day 3 polish item, not today's concern.

First Test: Confirm /upload Still Works

Before touching /query, we re-uploaded the document.

chunks_created: 34 - small_language_research.pdf, 34 chunks stored and embedded. Day 1's pipeline still running clean. This step is not optional: if ChromaDB has no vectors to search, /query returns empty results and the LLM produces a confused non-answer. Upload first, always.

The Moment of the Session: We Tried to Break It

Now the part I look forward to most in every AI session - deliberately trying to make the system fail.

We opened /docs, expanded POST /query, clicked Try it out, and typed a question I knew was nowhere in the uploaded document:

{
  "question": "Where is company Datapulse ai headquarted?",
  "top_k": 3
}

The document was a language research paper. There is no "Datapulse AI" in it. No headquarters. Nothing even close.

The response:

I read this out loud and said: "Look at what just happened. Retrieval found the three chunks it considered most relevant to that question - and they're all from the language research paper, which has nothing to do with Datapulse. The LLM received those chunks, saw they contained no useful information, and said exactly that. It did not make something up. It refused."

Then the line I want everyone to remember: "When the model correctly refuses to answer something it doesn't know - that is not the system being unhelpful. That is the system working exactly as designed. A confident wrong answer would have been the actual failure."

The sources array is visible in the response even on a refusal. That's correct - the system is being transparent about what it searched, even when what it found wasn't enough to answer. On Day 3 when students build frontends, they can display those sources as "searched but insufficient"- which is genuinely more trustworthy than silence.

We ran two more stress tests before closing:

Asking something vague - a fuzzy, poorly-phrased question about the same document. The answer came back, but less precisely. We talked about why: retrieval quality depends on the closeness between how the question is phrased and how the relevant chunk is worded. Better questions get better retrieval. Prompt quality matters on both ends.

Uploading two documents on different topics, then asking a question only answerable from one - the sources in the response correctly pointed to the right file, not the irrelevant one. Vectors from separate documents live in the same space, and the closest ones win regardless of which file they came from.

What Exists Now

I said this slowly at the end of the session:

"Upload any PDF or text file. Ask any question in plain English. The system embeds your question, searches the stored vectors for the most relevant chunks, builds a grounded prompt from those chunks, calls an LLM with strict instructions to answer only from what it was given, and returns an answer with citations pointing to the exact source chunks. If the answer isn't in the documents, it says so. That's a real product."

One new file. rag.py. That's all today took to cross the line from "interesting search engine" to "document intelligence API."

I'm an AI engineer in Nairobi
Follow along or drop your questions in the comments.

DEV Community