How I Built an AI-Powered Q&A System
Have you ever wished you could ask specific questions about a travel destination and get accurate, sourced answers? That's precisely what I set out to build and in this article, I'll walk you through creating a Retrieval-Augmented Generation (RAG) system for Kenya's tourism industry.
The Problem: AI That Makes Things Up
Large Language Models (LLMs) are impressive, but they have a fatal flaw: they confidently generate information that sounds right but might be completely wrong. Ask ChatGPT about the best time to visit the Maasai Mara, and it might give you a reasonable answer or it might hallucinate facts about wildebeest migration patterns.
This is where RAG comes in. Instead of relying on what the AI "thinks" it knows, we give it a library of trusted documents and teach it to search through them before answering. Think of it as moving from a student who wings their exam to one who brings a cheat sheet with verified facts.
What We're Building
Our system ingests PDF documents about Kenyan tourism destinations (Maasai Mara, Mombasa, Mount Kenya, etc.) and provides a REST API where users can ask questions like the following:
- "What wildlife can I see at Maasai Mara?"
- "What are the best beaches in Mombasa?"
- "How difficult is it to climb Mount Kenya?"
The system will:
- Search through the PDF documents for relevant information
- Extract the most pertinent passages
- Use an LLM to generate a natural language answer based only on those passages
- Return the sources so users can verify the information
The Tech Stack
Here's what we're using and why:
- FastAPI: Lightning-fast Python web framework, perfect for building APIs
- Sentence Transformers: Converts text to embeddings (fancy math that makes similar text have similar numbers)
- ChromaDB: Stores and searches through those embeddings efficiently
- Groq: Blazingly fast LLM inference (seriously, it's ridiculously fast)
- pypdf: Extracts text from PDF documents
Architecture: The 30,000-Foot View
PDFs → Text Extraction → Chunking → Embeddings → Vector Database
↓
User Query → Embedding → Similarity Search → Context → LLM → Answer
We have two main pipelines:
- Ingestion Pipeline (run once): Takes PDFs, breaks them into chunks, converts chunks to vectors, stores in a database.
- Query Pipeline (run every query): Takes question, converts to vector, finds similar chunks, sends to LLM for an answer.
Step 1: Document Ingestion — Teaching the System to Read
Let's start with the ingestion script. This is where the magic of preparing our knowledge base happens.
Extracting Text from PDFs
from pypdf import PdfReader
def extract_text(path):
reader = PdfReader(path)
text = ""
for page in reader.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
return text
Simple enough, we read each page and concatenate the text. But here's the thing: PDFs are notoriously tricky. Some have scanned images (which need OCR), some have weird encodings, and some have tables that don't extract well. For this project, I assumed clean, text-based PDFs. In production, you'd want more robust error handling.
The Chunking Strategy: Why Size Matters
def chunk_text(text, size=300):
words = text.split()
chunks = []
for i in range(0, len(words), size):
chunks.append(" ".join(words[i:i+size]))
return chunks
Why chunk at all? LLMs have context windows, and we can't feed them entire books. More importantly, smaller chunks mean more precise retrieval. If your document chunk is an entire chapter about Mombasa and someone asks about beaches, you'll retrieve all of Mombasa's beaches, hotels, restaurants and history. That's too much noise.
I chose 300 words per chunk through experimentation. Too small (100 words) and you lose context. Too large (1000 words) and your retrieval becomes imprecise.
Embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
def normalize_embeddings(embeddings):
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
return (embeddings/norms).tolist()
Here's where things get interesting. Embeddings convert text into high dimensional vectors (arrays of numbers). Similar text gets similar vectors. "The lion roared" and "The big cat made a loud sound" will have vectors that are close together in this mathematical space.
I chose BAAI/bge-small-en-v1.5because:
- It's small (133M parameters) fast inference
- It's good at semantic search tasks
- It's actively maintained and well documented
The normalization step is crucial. It converts vectors to unit length, which makes cosine similarity (how ChromaDB compares vectors) equivalent to dot product and dot product is faster to compute.
Storing Everything in ChromaDB
import chromadb
client = chromadb.PersistentClient(path="./chromadb")
collection = client.get_or_create_collection(
name="travel_and_tourism",
metadata={"description": "Multi PDF Tourism documents"}
)
collection.add(
documents=all_chunks,
embeddings=all_embeddings,
ids=all_ids,
metadatas=all_metadatas
)
ChromaDB is a vector database designed for this exact use case. It:
- Stores embeddings efficiently
- Provides fast similarity search
- Persists data to disk
- Has a simple Python API
The PersistentClient means our vectors survive restarts. We don't have to re-embed all our documents every time we start the server.
Step 2: The Query Pipeline
Now for the fun part: answering questions.
Converting Questions to Vectors
def ask(question: str):
query_embedding = model.encode([question])
query_embedding = normalize_embeddings(query_embedding)
We use the same embedding model we used for documents. This is critical. If you embed documents with Model A and queries with Model B, the vector spaces won't align.
Similarity Search
results = collection.query(
query_embeddings=query_embedding,
n_results=3
)
docs = results['documents'][0]
metadatas = results["metadatas"][0]
ChromaDB finds the 3 most similar document chunks to our query. How does it know what's similar? It computes the distance between the query vector and every document vector, then returns the closest ones.
Why 3? Another Goldilocks number. Too few (1) and you might miss important context. Too many (10) and you'll include irrelevant information that confuses the LLM. I tested several values and found 3 provided the best balance.
The LLM
from groq import Groq
groq_client = Groq(api_key=os.getenv("GROQ_API_KEY"))
context = "\n\n".join(docs)
response = groq_client.chat.completions.create(
model="meta-llama/llama-4-scout-17b-16e-instruct",
messages=[
{"role": "system", "content": "Answer only using provided context"},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{question}"}
],
temperature=0
)
answer = response.choices[0].message.content
This is where RAG shines. We give the LLM:
- A system instruction: "Only use the provided context" (reducing hallucinations)
- The retrieved context
- The user's question
The temperature=0 setting makes the model deterministic; the same input always produces the same output. This is crucial for reliability.
Why Groq? Speed. Seriously, it's fast. What takes OpenAI 3-4 seconds, Groq does in under a second. For user facing applications, this matters.
Source Attribution
sources = list({meta["source"] for meta in metadatas})
return answer, sources
We return the source PDFs used to generate the answer. This serves two purposes:
- Users can verify the information
- It builds trust in the system
Step 3: The FastAPI
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI(title="Travel and Tourism")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
@app.post("/ask", response_model=QuestionResponse)
def ask_question(request: QuestionRequest):
try:
answer, sources = ask(request.question)
return QuestionResponse(answer=answer, sources=sources)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
FastAPI gives us:
- Automatic API documentation (visit
/docsto see it) - Request validation via Pydantic models
- Type hints that actually work
- Easy async support (though we're not using it here)
The CORS middleware allows frontend applications from any origin to call our API. In production, you'd restrict this to your specific domain.
The Results: Does It Actually Work?
Let's test it:
Query: "What wildlife can I see at Maasai Mara?"
Response:
{
"answer": "At Maasai Mara, you can see the Big Five: lions, elephants, leopards, rhinos, and buffalo. The park is famous for the annual wildebeest migration between July and October, where millions of wildebeest, zebras, and gazelles cross the Mara River. You can also spot cheetahs, hyenas, giraffes, hippos, crocodiles, and over 450 bird species.",
"sources": ["Maasai_Mara.pdf"]
}
Beautiful. The answer is specific, accurate, and sourced.
The Bigger Picture: Why RAG Matters
RAG represents a fundamental shift in how we build AI applications. Instead of:
- Fine-tuning models (expensive, time-consuming, static)
- Relying on model knowledge (outdated, prone to hallucination)
We can:
- Use any LLM as a reasoning engine
- Plug in our own knowledge dynamically
- Update information without retraining
- Provide source attribution for trust
This pattern works for:
- Customer support bots trained on company documentation
- Legal research tools searching case law
- Medical assistants referencing clinical guidelines
- Internal knowledge bases for enterprises
Conclusion
Building this RAG system taught me that the real challenge isn't the AI it's the data pipeline, retrieval strategy, and user experience. The LLM is just the final step that ties everything together.
RAG won't solve all AI problems. But for question-answering over documents, it's incredibly powerful. And as embedding models improve, vector databases get faster, and LLMs become more capable, RAG systems will only get better.
Code Snippets
All code in this article is available in my GitHub repository [https://github.com/maureenmuthoni-hue/Travel_and_Tourism_RAG_System]. Feel free to star, fork, and adapt it for your own projects!
Top comments (0)