Why building production-ready RAG systems shouldn't feel like assembling IKEA furniture in the dark
The RAG Reality Check
Picture this: You've just convinced your team to build an AI-powered documentation search. The plan is simple—take your company's docs, throw them into a vector database, add some LLM magic, and boom: instant answers for your users.
Three weeks later, you're drowning in boilerplate. Your codebase looks like a Frankenstein monster stitched together from five different libraries. There's error handling scattered everywhere (or worse, nowhere). You're manually chunking documents, wrestling with embedding APIs, debugging vector similarity algorithms, and somehow your "simple" RAG pipeline has morphed into 2,000 lines of glue code.
Sound familiar?
This is where Ballerina walks in, takes a look at your mess, and says: "Why are you working so hard?"
What Makes RAG Systems Complex?
Before we dive into the solution, let's break down why RAG pipelines are deceptively complex:
The RAG Dance (and Why It's Tricky)
A typical RAG workflow involves:
- Document Loading: Read files from various sources (PDFs, markdown, databases)
- Chunking Strategy: Split documents intelligently without losing context
- Embedding Generation: Convert text chunks into vectors using an embedding model
- Vector Storage: Store embeddings efficiently with metadata
- Query Processing: When a user asks a question, embed their query
- Similarity Search: Find the most relevant document chunks
- Context Augmentation: Combine retrieved chunks with the user's query
- LLM Generation: Send the enriched prompt to an LLM for a final answer
In traditional frameworks, each of these steps requires:
- Choosing and configuring multiple libraries
- Writing integration code between incompatible APIs
- Manual error handling at every boundary
- Custom observability instrumentation
- Performance optimization from scratch
Most developers spend 70% of their time on plumbing and only 30% on actual AI logic.
Enter Ballerina: Network-Native Meets AI-Native
Ballerina was designed with a radical idea: What if distributed systems programming was actually... easy?
Now, with Ballerina's AI module, this philosophy extends to RAG pipelines. Instead of stitching together separate libraries for embeddings, vector stores, chunking, and LLMs, you get a unified, type-safe API that handles the complexity for you.
The "Aha!" Moment
Let's compare building a RAG system in two approaches:
Traditional Approach (Python + Multiple Libraries):
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
import os
# Configure everything separately
loader = TextLoader("./leave_policy.md")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_KEY"))
vectorstore = Chroma.from_documents(chunks, embeddings)
llm = OpenAI(api_key=os.getenv("OPENAI_KEY"))
# Now try to make them work together...
query = "How many leave days can I carry forward?"
docs = vectorstore.similarity_search(query, k=10)
context = "\n".join([doc.page_content for doc in docs])
prompt = f"Context: {context}\n\nQuestion: {query}"
answer = llm(prompt)
Ballerina Approach:
import ballerina/ai;
import ballerina/io;
// Initialize the RAG system
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();
final ai:EmbeddingProvider embeddingProvider =
check ai:getDefaultEmbeddingProvider();
final ai:KnowledgeBase knowledgeBase =
new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();
public function main() returns error? {
// Load and ingest documents
ai:DataLoader loader = check new ai:TextDataLoader("./leave_policy.md");
ai:Document|ai:Document[] documents = check loader.load();
check knowledgeBase.ingest(documents);
// Query and get answer
string query = "How many leave days can I carry forward?";
ai:QueryMatch[] matches = check knowledgeBase.retrieve(query, 10);
ai:Chunk[] context = from ai:QueryMatch match in matches
select match.chunk;
ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);
ai:ChatAssistantMessage answer = check modelProvider->chat(augmentedQuery);
io:println("Answer: ", answer.content);
}
Notice what's different:
- No manual chunking configuration (handled intelligently by default)
- No explicit embedding API calls (abstracted away)
- No prompt engineering boilerplate (built-in augmentation)
- No manual context concatenation (query expressions handle it)
- Clean error handling with
checkkeyword - Everything is type-safe
Building a Real RAG System: Employee Handbook Assistant
Let's build something practical: an AI assistant that answers questions about your company's employee handbook. We'll cover the complete journey from document ingestion to production deployment.
The Setup: Configuration Made Easy
First, let's understand Ballerina's provider pattern:
import ballerina/ai;
import ballerina/io;
// The VectorStore holds your embeddings
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();
// EmbeddingProvider converts text to vectors
// Uses your configured provider (OpenAI, Cohere, etc.) via Ballerina VS Code command
final ai:EmbeddingProvider embeddingProvider =
check ai:getDefaultEmbeddingProvider();
// KnowledgeBase combines storage + embeddings
final ai:KnowledgeBase knowledgeBase =
new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);
// ModelProvider handles LLM chat completions
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();
What's happening here?
Ballerina uses a configuration-based approach. Instead of hardcoding API keys and model names, you configure providers through Ballerina's VS Code extension or configuration files. This means:
- No secrets in code
- Easy switching between providers (OpenAI, Azure, local models)
- Environment-specific configurations
- Type-safe provider interfaces
Note: This example uses the default embedding provider and model provider implementations. To generate the necessary configuration, open up the VS Code command palette (Ctrl + Shift + P or command + shift + P), and run the Configure default WSO2 Model Provider command to add your configuration to the Config.toml file.
Step 1: Document Ingestion - Simpler Than You Think
The hardest part of building a RAG system is usually getting data in. Ballerina's DataLoader abstractions make this trivial:
public function ingestEmployeeHandbook() returns error? {
// Load a single document
ai:DataLoader policyLoader = check new ai:TextDataLoader("./leave_policy.md");
ai:Document|ai:Document[] policyDocs = check policyLoader.load();
check knowledgeBase.ingest(policyDocs);
io:println("✅ Leave policy ingested");
// Load multiple documents from a directory
ai:DataLoader benefitsLoader = check new ai:TextDataLoader("./benefits/");
ai:Document|ai:Document[] benefitsDocs = check benefitsLoader.load();
check knowledgeBase.ingest(benefitsDocs);
io:println("✅ Benefits documentation ingested");
// You can also load PDFs, Word docs, etc.
ai:DataLoader handbookLoader = check new ai:PDFDataLoader("./handbook.pdf");
ai:Document|ai:Document[] handbookDocs = check handbookLoader.load();
check knowledgeBase.ingest(handbookDocs);
io:println("✅ Employee handbook ingested");
}
The Magic Behind the Scenes:
When you call knowledgeBase.ingest(), Ballerina automatically:
- Chunks your documents using intelligent text splitting (respects sentence boundaries, maintains context)
- Generates embeddings for each chunk using your configured embedding provider
- Stores vectors with metadata in the vector store
- Handles errors gracefully with proper propagation
No manual chunking. No explicit embedding calls. Just load and ingest.
Step 2: The Query Pipeline - Where RAG Shines
Now comes the fun part: answering questions with context from your documents.
type RAGResponse record {|
string answer;
Source[] sources;
float averageRelevance;
|};
type Source record {|
string content;
float relevance;
|};
function answerQuestion(string query, int topK = 5) returns RAGResponse|error {
io:println("\n🔍 Query: ", query);
// Step 1: Retrieve relevant chunks from knowledge base
ai:QueryMatch[] queryMatches = check knowledgeBase.retrieve(query, topK);
// Step 2: Extract chunks for context
ai:Chunk[] context = from ai:QueryMatch queryMatch in queryMatches
select queryMatch.chunk;
// Step 3: Augment user query with retrieved context
ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);
// Step 4: Get answer from LLM
ai:ChatAssistantMessage assistantMessage =
check modelProvider->chat(augmentedQuery);
// Step 5: Calculate average relevance for confidence scoring
float totalRelevance = 0.0;
foreach ai:QueryMatch match in queryMatches {
totalRelevance += match.score;
}
float avgRelevance = queryMatches.length() > 0 ?
totalRelevance / queryMatches.length() : 0.0;
// Step 6: Build response with sources
Source[] sources = from ai:QueryMatch match in queryMatches
select {
content: match.chunk.content,
relevance: match.score
};
return {
answer: assistantMessage.content,
sources: sources,
averageRelevance: avgRelevance
};
}
Breaking Down the Query Expression:
One of Ballerina's superpowers is query expressions. Look at this beauty:
ai:Chunk[] context = from ai:QueryMatch queryMatch in queryMatches
select queryMatch.chunk;
This is not just syntactic sugar. It's a type-safe, functional way to transform data that:
- Reads like SQL
- Compiles to efficient code
- Maintains type safety throughout the pipeline
- Makes data transformations explicit and readable
Compare this to Python's list comprehension or JavaScript's map—it's clearer and more maintainable.
Step 3: Understanding Query Augmentation
The ai:augmentUserQuery() function is doing heavy lifting behind the scenes:
ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);
This function takes your retrieved chunks and combines them with the user's query using a well-designed prompt template. It's essentially doing:
You are a helpful assistant. Answer the question based on the following context.
Context:
[Chunk 1 content]
[Chunk 2 content]
[Chunk 3 content]
Question: {user's query}
Answer:
But you don't have to worry about prompt engineering—Ballerina handles it with best practices baked in.
Step 4: Making it Production-Ready with HTTP Service
Let's expose our RAG system as a REST API:
import ballerina/http;
import ballerina/ai;
import ballerina/io;
// Initialize RAG components (same as before)
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();
final ai:EmbeddingProvider embeddingProvider =
check ai:getDefaultEmbeddingProvider();
final ai:KnowledgeBase knowledgeBase =
new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();
type QueryRequest record {|
string question;
int topK = 5;
|};
type QueryResponse record {|
string answer;
Source[] sources;
float confidence;
|};
type Source record {|
string content;
float relevance;
|};
service /api on new http:Listener(8080) {
// Initialize knowledge base on service startup
function init() returns error? {
io:println("🚀 Starting Employee Handbook Assistant...");
// Ingest documents
ai:DataLoader loader = check new ai:TextDataLoader("./documents/");
ai:Document|ai:Document[] documents = check loader.load();
check knowledgeBase.ingest(documents);
io:println("✅ Knowledge base initialized");
}
// Query endpoint
resource function post query(@http:Payload QueryRequest request)
returns QueryResponse|http:InternalServerError {
// Retrieve relevant chunks
ai:QueryMatch[]|error queryMatches =
knowledgeBase.retrieve(request.question, request.topK);
if queryMatches is error {
return {
body: {
message: "Failed to retrieve context",
error: queryMatches.message()
}
};
}
// Extract context chunks
ai:Chunk[] context = from ai:QueryMatch match in queryMatches
select match.chunk;
// Augment query and get answer
ai:ChatUserMessage augmentedQuery =
ai:augmentUserQuery(context, request.question);
ai:ChatAssistantMessage|error assistantMessage =
modelProvider->chat(augmentedQuery);
if assistantMessage is error {
return {
body: {
message: "Failed to generate answer",
error: assistantMessage.message()
}
};
}
// Calculate confidence
float totalRelevance = 0.0;
foreach ai:QueryMatch match in queryMatches {
totalRelevance += match.score;
}
float confidence = queryMatches.length() > 0 ?
totalRelevance / queryMatches.length() : 0.0;
// Build sources
Source[] sources = from ai:QueryMatch match in queryMatches
select {
content: match.chunk.content,
relevance: match.score
};
return {
answer: assistantMessage.content,
sources: sources,
confidence: confidence
};
}
// Health check endpoint
resource function get health() returns json {
return {
status: "healthy",
service: "Employee Handbook Assistant",
timestamp: time:utcNow()
};
}
// Add new documents endpoint
resource function post documents(http:Request req)
returns http:Created|http:BadRequest|http:InternalServerError {
string|http:ClientError filePath = req.getTextPayload();
if filePath is http:ClientError {
return {body: "Invalid file path"};
}
ai:DataLoader|error loader = new ai:TextDataLoader(filePath);
if loader is error {
return {body: "Failed to create data loader"};
}
ai:Document|ai:Document[]|error documents = loader.load();
if documents is error {
return {body: "Failed to load documents"};
}
error? result = knowledgeBase.ingest(documents);
if result is error {
return {body: "Failed to ingest documents"};
}
return {
body: {message: "Documents ingested successfully"}
};
}
}
What Makes This Special:
-
Automatic JSON Binding: The
@http:Payloadannotation automatically deserializes JSON to Ballerina records - Type-Safe Responses: Return types are checked at compile time
-
Built-in Error Handling: Union types (
|error) force you to handle failures - Clean Service Definition: RESTful endpoints with clear resource functions
Top comments (0)