Abhi

Posted on Oct 28

RAG Pipelines Made Simple: Ballerina for AI-Powered Search

#ballerina #rag #ai

Why building production-ready RAG systems shouldn't feel like assembling IKEA furniture in the dark

The RAG Reality Check

Picture this: You've just convinced your team to build an AI-powered documentation search. The plan is simple—take your company's docs, throw them into a vector database, add some LLM magic, and boom: instant answers for your users.

Three weeks later, you're drowning in boilerplate. Your codebase looks like a Frankenstein monster stitched together from five different libraries. There's error handling scattered everywhere (or worse, nowhere). You're manually chunking documents, wrestling with embedding APIs, debugging vector similarity algorithms, and somehow your "simple" RAG pipeline has morphed into 2,000 lines of glue code.

Sound familiar?

This is where Ballerina walks in, takes a look at your mess, and says: "Why are you working so hard?"

What Makes RAG Systems Complex?

Before we dive into the solution, let's break down why RAG pipelines are deceptively complex:

The RAG Dance (and Why It's Tricky)

A typical RAG workflow involves:

Document Loading: Read files from various sources (PDFs, markdown, databases)
Chunking Strategy: Split documents intelligently without losing context
Embedding Generation: Convert text chunks into vectors using an embedding model
Vector Storage: Store embeddings efficiently with metadata
Query Processing: When a user asks a question, embed their query
Similarity Search: Find the most relevant document chunks
Context Augmentation: Combine retrieved chunks with the user's query
LLM Generation: Send the enriched prompt to an LLM for a final answer

In traditional frameworks, each of these steps requires:

Choosing and configuring multiple libraries
Writing integration code between incompatible APIs
Manual error handling at every boundary
Custom observability instrumentation
Performance optimization from scratch

Most developers spend 70% of their time on plumbing and only 30% on actual AI logic.

Enter Ballerina: Network-Native Meets AI-Native

Ballerina was designed with a radical idea: What if distributed systems programming was actually... easy?

Now, with Ballerina's AI module, this philosophy extends to RAG pipelines. Instead of stitching together separate libraries for embeddings, vector stores, chunking, and LLMs, you get a unified, type-safe API that handles the complexity for you.

The "Aha!" Moment

Let's compare building a RAG system in two approaches:

Traditional Approach (Python + Multiple Libraries):

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
import os

# Configure everything separately
loader = TextLoader("./leave_policy.md")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(api_key=os.getenv("OPENAI_KEY"))
vectorstore = Chroma.from_documents(chunks, embeddings)

llm = OpenAI(api_key=os.getenv("OPENAI_KEY"))

# Now try to make them work together...
query = "How many leave days can I carry forward?"
docs = vectorstore.similarity_search(query, k=10)
context = "\n".join([doc.page_content for doc in docs])
prompt = f"Context: {context}\n\nQuestion: {query}"
answer = llm(prompt)

Ballerina Approach:

import ballerina/ai;
import ballerina/io;

// Initialize the RAG system
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();
final ai:EmbeddingProvider embeddingProvider = 
    check ai:getDefaultEmbeddingProvider();
final ai:KnowledgeBase knowledgeBase = 
    new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();

public function main() returns error? {
    // Load and ingest documents
    ai:DataLoader loader = check new ai:TextDataLoader("./leave_policy.md");
    ai:Document|ai:Document[] documents = check loader.load();
    check knowledgeBase.ingest(documents);

    // Query and get answer
    string query = "How many leave days can I carry forward?";
    ai:QueryMatch[] matches = check knowledgeBase.retrieve(query, 10);
    ai:Chunk[] context = from ai:QueryMatch match in matches 
                         select match.chunk;

    ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);
    ai:ChatAssistantMessage answer = check modelProvider->chat(augmentedQuery);

    io:println("Answer: ", answer.content);
}

Notice what's different:

No manual chunking configuration (handled intelligently by default)
No explicit embedding API calls (abstracted away)
No prompt engineering boilerplate (built-in augmentation)
No manual context concatenation (query expressions handle it)
Clean error handling with check keyword
Everything is type-safe

Building a Real RAG System: Employee Handbook Assistant

Let's build something practical: an AI assistant that answers questions about your company's employee handbook. We'll cover the complete journey from document ingestion to production deployment.

The Setup: Configuration Made Easy

First, let's understand Ballerina's provider pattern:

import ballerina/ai;
import ballerina/io;

// The VectorStore holds your embeddings
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();

// EmbeddingProvider converts text to vectors
// Uses your configured provider (OpenAI, Cohere, etc.) via Ballerina VS Code command
final ai:EmbeddingProvider embeddingProvider = 
    check ai:getDefaultEmbeddingProvider();

// KnowledgeBase combines storage + embeddings
final ai:KnowledgeBase knowledgeBase = 
    new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);

// ModelProvider handles LLM chat completions
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();

What's happening here?

Ballerina uses a configuration-based approach. Instead of hardcoding API keys and model names, you configure providers through Ballerina's VS Code extension or configuration files. This means:

No secrets in code
Easy switching between providers (OpenAI, Azure, local models)
Environment-specific configurations
Type-safe provider interfaces

Note: This example uses the default embedding provider and model provider implementations. To generate the necessary configuration, open up the VS Code command palette (Ctrl + Shift + P or command + shift + P), and run the Configure default WSO2 Model Provider command to add your configuration to the Config.toml file.

Step 1: Document Ingestion - Simpler Than You Think

The hardest part of building a RAG system is usually getting data in. Ballerina's DataLoader abstractions make this trivial:

public function ingestEmployeeHandbook() returns error? {
    // Load a single document
    ai:DataLoader policyLoader = check new ai:TextDataLoader("./leave_policy.md");
    ai:Document|ai:Document[] policyDocs = check policyLoader.load();
    check knowledgeBase.ingest(policyDocs);
    io:println("✅ Leave policy ingested");

    // Load multiple documents from a directory
    ai:DataLoader benefitsLoader = check new ai:TextDataLoader("./benefits/");
    ai:Document|ai:Document[] benefitsDocs = check benefitsLoader.load();
    check knowledgeBase.ingest(benefitsDocs);
    io:println("✅ Benefits documentation ingested");

    // You can also load PDFs, Word docs, etc.
    ai:DataLoader handbookLoader = check new ai:PDFDataLoader("./handbook.pdf");
    ai:Document|ai:Document[] handbookDocs = check handbookLoader.load();
    check knowledgeBase.ingest(handbookDocs);
    io:println("✅ Employee handbook ingested");
}

The Magic Behind the Scenes:

When you call knowledgeBase.ingest(), Ballerina automatically:

Chunks your documents using intelligent text splitting (respects sentence boundaries, maintains context)
Generates embeddings for each chunk using your configured embedding provider
Stores vectors with metadata in the vector store
Handles errors gracefully with proper propagation

No manual chunking. No explicit embedding calls. Just load and ingest.

Step 2: The Query Pipeline - Where RAG Shines

Now comes the fun part: answering questions with context from your documents.

type RAGResponse record {|
    string answer;
    Source[] sources;
    float averageRelevance;
|};

type Source record {|
    string content;
    float relevance;
|};

function answerQuestion(string query, int topK = 5) returns RAGResponse|error {
    io:println("\n🔍 Query: ", query);

    // Step 1: Retrieve relevant chunks from knowledge base
    ai:QueryMatch[] queryMatches = check knowledgeBase.retrieve(query, topK);

    // Step 2: Extract chunks for context
    ai:Chunk[] context = from ai:QueryMatch queryMatch in queryMatches
                         select queryMatch.chunk;

    // Step 3: Augment user query with retrieved context
    ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);

    // Step 4: Get answer from LLM
    ai:ChatAssistantMessage assistantMessage = 
        check modelProvider->chat(augmentedQuery);

    // Step 5: Calculate average relevance for confidence scoring
    float totalRelevance = 0.0;
    foreach ai:QueryMatch match in queryMatches {
        totalRelevance += match.score;
    }
    float avgRelevance = queryMatches.length() > 0 ? 
        totalRelevance / queryMatches.length() : 0.0;

    // Step 6: Build response with sources
    Source[] sources = from ai:QueryMatch match in queryMatches
                       select {
                           content: match.chunk.content,
                           relevance: match.score
                       };

    return {
        answer: assistantMessage.content,
        sources: sources,
        averageRelevance: avgRelevance
    };
}

Breaking Down the Query Expression:

One of Ballerina's superpowers is query expressions. Look at this beauty:

ai:Chunk[] context = from ai:QueryMatch queryMatch in queryMatches
                     select queryMatch.chunk;

This is not just syntactic sugar. It's a type-safe, functional way to transform data that:

Reads like SQL
Compiles to efficient code
Maintains type safety throughout the pipeline
Makes data transformations explicit and readable

Compare this to Python's list comprehension or JavaScript's map—it's clearer and more maintainable.

Step 3: Understanding Query Augmentation

The ai:augmentUserQuery() function is doing heavy lifting behind the scenes:

ai:ChatUserMessage augmentedQuery = ai:augmentUserQuery(context, query);

This function takes your retrieved chunks and combines them with the user's query using a well-designed prompt template. It's essentially doing:

You are a helpful assistant. Answer the question based on the following context.

Context:
[Chunk 1 content]
[Chunk 2 content]
[Chunk 3 content]

Question: {user's query}

Answer:

But you don't have to worry about prompt engineering—Ballerina handles it with best practices baked in.

Step 4: Making it Production-Ready with HTTP Service

Let's expose our RAG system as a REST API:

import ballerina/http;
import ballerina/ai;
import ballerina/io;

// Initialize RAG components (same as before)
final ai:VectorStore vectorStore = check new ai:InMemoryVectorStore();
final ai:EmbeddingProvider embeddingProvider = 
    check ai:getDefaultEmbeddingProvider();
final ai:KnowledgeBase knowledgeBase = 
    new ai:VectorKnowledgeBase(vectorStore, embeddingProvider);
final ai:ModelProvider modelProvider = check ai:getDefaultModelProvider();

type QueryRequest record {|
    string question;
    int topK = 5;
|};

type QueryResponse record {|
    string answer;
    Source[] sources;
    float confidence;
|};

type Source record {|
    string content;
    float relevance;
|};

service /api on new http:Listener(8080) {

    // Initialize knowledge base on service startup
    function init() returns error? {
        io:println("🚀 Starting Employee Handbook Assistant...");

        // Ingest documents
        ai:DataLoader loader = check new ai:TextDataLoader("./documents/");
        ai:Document|ai:Document[] documents = check loader.load();
        check knowledgeBase.ingest(documents);

        io:println("✅ Knowledge base initialized");
    }

    // Query endpoint
    resource function post query(@http:Payload QueryRequest request) 
            returns QueryResponse|http:InternalServerError {

        // Retrieve relevant chunks
        ai:QueryMatch[]|error queryMatches = 
            knowledgeBase.retrieve(request.question, request.topK);

        if queryMatches is error {
            return {
                body: {
                    message: "Failed to retrieve context",
                    error: queryMatches.message()
                }
            };
        }

        // Extract context chunks
        ai:Chunk[] context = from ai:QueryMatch match in queryMatches
                             select match.chunk;

        // Augment query and get answer
        ai:ChatUserMessage augmentedQuery = 
            ai:augmentUserQuery(context, request.question);
        ai:ChatAssistantMessage|error assistantMessage = 
            modelProvider->chat(augmentedQuery);

        if assistantMessage is error {
            return {
                body: {
                    message: "Failed to generate answer",
                    error: assistantMessage.message()
                }
            };
        }

        // Calculate confidence
        float totalRelevance = 0.0;
        foreach ai:QueryMatch match in queryMatches {
            totalRelevance += match.score;
        }
        float confidence = queryMatches.length() > 0 ? 
            totalRelevance / queryMatches.length() : 0.0;

        // Build sources
        Source[] sources = from ai:QueryMatch match in queryMatches
                           select {
                               content: match.chunk.content,
                               relevance: match.score
                           };

        return {
            answer: assistantMessage.content,
            sources: sources,
            confidence: confidence
        };
    }

    // Health check endpoint
    resource function get health() returns json {
        return {
            status: "healthy",
            service: "Employee Handbook Assistant",
            timestamp: time:utcNow()
        };
    }

    // Add new documents endpoint
    resource function post documents(http:Request req) 
            returns http:Created|http:BadRequest|http:InternalServerError {

        string|http:ClientError filePath = req.getTextPayload();

        if filePath is http:ClientError {
            return {body: "Invalid file path"};
        }

        ai:DataLoader|error loader = new ai:TextDataLoader(filePath);
        if loader is error {
            return {body: "Failed to create data loader"};
        }

        ai:Document|ai:Document[]|error documents = loader.load();
        if documents is error {
            return {body: "Failed to load documents"};
        }

        error? result = knowledgeBase.ingest(documents);
        if result is error {
            return {body: "Failed to ingest documents"};
        }

        return {
            body: {message: "Documents ingested successfully"}
        };
    }
}

What Makes This Special:

Automatic JSON Binding: The @http:Payload annotation automatically deserializes JSON to Ballerina records
Type-Safe Responses: Return types are checked at compile time
Built-in Error Handling: Union types (|error) force you to handle failures
Clean Service Definition: RESTful endpoints with clear resource functions

DEV Community