๐ง Building RAGenius: A Production-Ready RAG System
Have you ever wanted to chat with your documents using AI? Whether it's PDFs, Excel spreadsheets, or JSON files - imagine having an intelligent assistant that can answer questions based on your entire document collection. That's exactly what I built with RAGenius!
๐ค What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that combines the power of large language models (LLMs) with your own data. Instead of relying solely on the model's training data, RAG:
- Retrieves relevant information from your documents
- Augments the LLM prompt with this context
- Generates accurate, contextual answers
This approach dramatically reduces hallucinations and allows LLMs to answer questions about your specific domain knowledge.
๐ก Why I Built RAGenius
While experimenting with various RAG implementations, I noticed most tutorials focused on simple, single-file demos. I wanted something more:
โ
Production-ready with proper error handling
โ
Multi-format support (PDF, Excel, JSON, DOCX, CSV, TXT)
โ
Streaming responses for better UX
โ
REST API for easy integration
โ
Incremental updates without rebuilding the entire index
Thus, RAGenius was born! ๐
๐๏ธ Architecture Overview
RAGenius follows a clean, modular architecture:
โโโโโโโโโโโโโโโ
โ Documents โ (PDF, Excel, JSON, etc.)
โโโโโโโโฌโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Data Loader โ (Multi-format processing)
โโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Chunking โ (Smart text splitting)
โโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Azure OpenAI โ (Generate embeddings)
โโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ ChromaDB โ (Vector storage)
โโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ RAG Engine โ (Query + Generate)
โโโโโโโโฌโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ FastAPI โ (REST API)
โโโโโโโโโโโโโโโโโโโ
๐ ๏ธ Tech Stack
- FastAPI: Lightning-fast API framework
- LangChain: Document processing and LLM orchestration
- ChromaDB: Vector database for embeddings
- Azure OpenAI: GPT-4 and embedding models
- Python 3.10+: Core language
- UV: Modern Python package manager
๐ Key Features Breakdown
1๏ธโฃ Multi-Format Document Processing
One of the coolest features is the ability to handle various file types seamlessly:
from src.data_loader import load_all_documents
# Automatically detects and loads all supported formats
docs = load_all_documents("data")
print(f"Loaded {len(docs)} documents")
The data_loader.py uses a smart mapping system:
LOADER_MAP = {
".pdf": PyPDFLoader,
".txt": lambda path: TextLoader(path, encoding="utf-8"),
".csv": CSVLoader,
".docx": Docx2txtLoader,
".json": JSONLoader,
".xlsx": UnstructuredExcelLoader,
}
2๏ธโฃ Smart Document Chunking
Not all text should be split the same way. RAGenius uses RecursiveCharacterTextSplitter with configurable parameters:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
Why overlap? It ensures context isn't lost at chunk boundaries - crucial for maintaining semantic coherence!
3๏ธโฃ Vector Storage with ChromaDB
ChromaDB provides persistent, efficient vector storage:
class ChromaVectorStore:
def __init__(self, persist_directory="chromadb_store"):
self.client = chromadb.PersistentClient(path=persist_directory)
self.collection = self.client.get_or_create_collection(
name="pdf_documents",
metadata={"description": "PDF embeddings using Azure OpenAI"}
)
Key benefit: Your embeddings persist across restarts - no need to re-process documents!
4๏ธโฃ Streaming RAG Responses
Modern UIs demand real-time feedback. RAGenius supports token-by-token streaming:
async def stream_query(self, question: str, top_k: int = 5):
"""Async generator for true token streaming"""
results = self.vectorstore.query(question, top_k=top_k)
docs = results.get("documents", [[]])[0]
if not docs:
yield "No relevant context found."
return
context = "\n\n".join(docs)
prompt = self._build_prompt(context, question)
async for chunk in self.llm.astream([HumanMessage(content=prompt)]):
token = getattr(chunk, "content", str(chunk))
yield token
5๏ธโฃ RESTful API with FastAPI
Three main endpoints power the system:
๐ค Upload Documents
curl -X POST "http://localhost:8000/rag/upload" \
-F "files=@document.pdf" \
-F "files=@spreadsheet.xlsx"
๐ Basic Query
curl -X POST "http://localhost:8000/rag/basic" \
-H "Content-Type: application/json" \
-d '{"query": "What is attention mechanism?", "top_k": 5}'
๐ Streaming Query
curl -X POST "http://localhost:8000/rag/stream" \
-H "Content-Type: application/json" \
-d '{"query": "Explain transformers", "top_k": 3}' \
--no-buffer
๐ฏ The RAG Pipeline in Action
Here's what happens when you ask a question:
- Query Embedding: Your question is converted to a vector using Azure OpenAI
- Similarity Search: ChromaDB finds the top-k most relevant document chunks
- Context Building: Retrieved chunks are combined into a context window
- Prompt Construction: The context and question are formatted into a prompt
- LLM Generation: GPT-4 generates an answer based on the provided context
- Streaming Response: Tokens are streamed back to the client in real-time
def query(self, question: str, top_k: int = 5):
# Step 1 & 2: Retrieve relevant context
results = self.vectorstore.query(question, top_k=top_k)
docs = results.get("documents", [[]])[0]
# Step 3: Build context
context = "\n\n".join(docs)
# Step 4: Construct prompt
prompt = f"""
Use the following context to answer the question.
Context:
{context}
Question: {question}
Answer:
"""
# Step 5: Generate response
response = self.llm.invoke([HumanMessage(content=prompt)])
return {"answer": response.content}
๐ Performance Optimizations
Chunking Strategy
- Chunk Size: 1000 characters - balances context vs. precision
- Overlap: 200 characters - maintains semantic continuity
- Smart Separators: Prioritizes paragraph breaks over word breaks
Embedding Efficiency
- Batch Processing: Multiple chunks embedded in single API calls
- Persistent Storage: Embeddings cached in ChromaDB
- Incremental Updates: Add new documents without re-embedding existing ones
Query Optimization
- Top-K Selection: Default k=5 balances relevance and token usage
- Temperature Control: 0.7 provides creative yet grounded responses
- Async Operations: Non-blocking streaming for better UX
๐ Challenges & Solutions
Challenge 1: JSONLoader Complexity
Problem: JSONLoader required jq_schema parameter, complicating multi-format support.
Solution: Implemented dynamic loader selection with custom error handling:
def dynamic_loader(file_path: str):
ext = Path(file_path).suffix.lower()
loader_cls = LOADER_MAP.get(ext)
if not loader_cls:
raise ValueError(f"โ Unsupported file type: {file_path}")
return loader_cls(file_path)
Challenge 2: Streaming with FastAPI
Problem: Server-Sent Events (SSE) format required careful handling.
Solution: Used StreamingResponse with proper headers:
return StreamingResponse(
stream_response(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no"
}
)
Challenge 3: File Upload Memory Management
Problem: Large file uploads could cause memory issues.
Solution: Temporary directory with automatic cleanup:
temp_dir = os.path.join(DATA_DIR, f"temp_{uuid.uuid4().hex[:8]}")
try:
# Process files
docs = load_all_documents(temp_dir)
vectorstore.add_documents(docs)
finally:
if os.path.exists(temp_dir):
shutil.rmtree(temp_dir)
๐ Lessons Learned
Modular Design Pays Off: Separating concerns (loading, embedding, storage, querying) made debugging and testing much easier.
Async is Essential: For streaming responses and file processing, async/await dramatically improves performance.
Error Handling Matters: Production systems need comprehensive logging and graceful error recovery.
Chunk Overlap is Critical: Without overlap, important context can be lost at boundaries, leading to incomplete answers.
Persistent Storage Rocks: ChromaDB's persistence means zero downtime for re-indexing after restarts.
๐ฎ Future Enhancements
Here's what's on the roadmap:
- [ ] Multi-LLM Support: OpenAI, Anthropic Claude, Cohere
- [ ] Web UI: React-based interface for document management
- [ ] Advanced Filtering: Metadata-based search refinement
- [ ] Cloud Storage Integration: S3, Azure Blob, Google Cloud Storage
- [ ] Conversation Memory: Multi-turn dialogue support
- [ ] Fine-tuned Embeddings: Domain-specific embedding models
- [ ] Kubernetes Manifests: Production-ready deployment configs
๐ Getting Started
Want to try RAGenius? It's super easy:
# Clone the repository
git clone https://github.com/AquibPy/RAGenius.git
cd RAGenius
# Install dependencies (using UV)
uv sync
# Set up environment variables
cp .env.example .env
# Add your Azure OpenAI credentials
# Start the server
uvicorn app:app --reload
# Visit http://localhost:8000/docs for API documentation
๐ Example Usage
Python Script
from src.search import RAGEngine
# Initialize
rag = RAGEngine()
# Query
result = rag.query(
"What is the attention mechanism in transformers?",
top_k=5
)
print(result["answer"])
CLI
python main.py \
--query "Explain BERT architecture" \
--mode streaming
API
import requests
response = requests.post(
"http://localhost:8000/rag/basic",
json={"query": "What is machine learning?", "top_k": 3}
)
print(response.json()["answer"])
๐ Conclusion
Building RAGenius has been an incredible learning experience. It combines cutting-edge AI technologies with practical software engineering to create a tool that's actually useful in production environments.
The beauty of RAG systems is that they make LLMs grounded in reality - answering questions based on YOUR data, not just internet-scale training data. Whether you're building internal knowledge bases, customer support systems, or research tools, RAG is the way forward.
๐ Links
- GitHub: AquibPy/RAGenius
- Demo Video: [Coming soon!]
๐ฌ Let's Connect!
I'd love to hear your thoughts and ideas! Feel free to:
- โญ Star the repo if you find it useful
- ๐ Report bugs or request features via GitHub Issues
- ๐ค Contribute through Pull Requests
- ๐ฌ Connect with me on X or LinkedIn
Have you built any RAG systems? What challenges did you face? Drop a comment below! ๐
Happy coding! ๐
Top comments (0)