pythonassignmenthelp.com

Posted on Apr 21

What Surprised Me About Building a Python RAG Pipeline with Open-Source LLMs

#ai #machinelearning #datascience #python

If you've ever tried using ChatGPT to answer questions about your company's docs or codebase, you know the pain: hallucinations, half-right answers, or just plain nonsense. Retrieval-Augmented Generation (RAG) is supposed to fix this, right? But what happens when you swap out the fancy OpenAI API for open-source models? That’s where things got interesting for me—and not always in a good way.

Why Go Open-Source for RAG?

I wanted to build a RAG pipeline that didn’t rely on proprietary APIs or cloud costs. Maybe you’ve hit rate limits, or your data can’t leave your network. Open-source LLMs promise freedom, but they come with their own quirks.

The thing is, RAG isn’t a magic bullet. The idea is simple: retrieve relevant context, shove it into an LLM, and hope for better answers. But when you switch to open-source models, you start seeing the seams.

Building the Pipeline: What Actually Works

I’ll walk through the stack I landed on, with code you can actually run. For context, I used sentence-transformers for embeddings and llama.cpp (via llama-cpp-python) for the LLM. I chose these because they’re popular, actively maintained, and don’t require a GPU (though you’ll want one if your docs are big).

Step 1: Chunking and Embedding Documents

You have to chop your docs up before embedding. If you feed giant blobs, retrieval gets fuzzy and slow. Here’s a basic way to chunk and embed using sentence-transformers:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained embedding model (all-MiniLM-L6-v2 is small and fast)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example document (could be your README, docs, etc.)
doc = """
RAG combines retrieval and text generation. 
It retrieves relevant info from your docs, then generates answers. 
Open-source LLMs can be used instead of OpenAI.
"""

# Simple chunking: split by sentences
chunks = doc.strip().split('\n')

# Embed each chunk
embeddings = model.encode(chunks)  # shape: (num_chunks, embedding_dim)

# Store embeddings for later retrieval
chunk_db = dict(zip(chunks, embeddings))

Key lines explained:

SentenceTransformer('all-MiniLM-L6-v2') loads a small, fast embedding model.
chunks is just splitting by line/sentence. For real docs, you'll want smarter chunking (paragraphs, sliding windows).
embeddings gives you a vector for each chunk—these are what you’ll use to find relevant context.

Step 2: Retrieval — Matching Queries to Chunks

Now, when a user asks a question, you want to find the most relevant chunk(s). Cosine similarity is your friend.

from sklearn.metrics.pairwise import cosine_similarity

def retrieve_context(query, chunk_db, model, top_k=2):
    # Embed the query
    query_vec = model.encode([query])
    # Calculate similarities to each chunk
    chunk_texts = list(chunk_db.keys())
    chunk_vecs = np.array(list(chunk_db.values()))
    sims = cosine_similarity(query_vec, chunk_vecs)[0]
    # Get top_k chunks
    top_indices = np.argsort(sims)[-top_k:][::-1]
    return [chunk_texts[i] for i in top_indices]

# Example usage
user_query = "How does RAG use open-source LLMs?"
context_chunks = retrieve_context(user_query, chunk_db, model)
print("Retrieved context:", context_chunks)

Key lines explained:

cosine_similarity finds which chunks are closest to your query.
top_k gets the most relevant pieces. If your docs are big, tune this.
The retrieval step is surprisingly fast even on a laptop.

Step 3: Generation — Using llama.cpp as the LLM

Here’s where things get real. Open-source LLMs are slower and have stricter input limits than OpenAI’s APIs. You often have to trim context or pick smaller models.

I used llama-cpp-python to run a local Llama 2 model. Here’s a basic generation example:

from llama_cpp import Llama

# Initialize the model (point to where you downloaded your Llama. e.g. 'llama-2-7b.Q4_0.gguf')
llm = Llama(model_path="llama-2-7b.Q4_0.gguf", n_ctx=2048)

def rag_answer(query, context_chunks, llm):
    # Construct prompt (simple, but effective)
    context = "\n".join(context_chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
    # Generate
    output = llm(prompt, max_tokens=200, stop=["\n"])
    return output['choices'][0]['text'].strip()

# Example usage
answer = rag_answer(user_query, context_chunks, llm)
print("LLM Answer:", answer)

Key lines explained:

Llama(model_path, n_ctx=2048) loads the model with a context window. If you go bigger, you need more RAM.
The prompt is simple: paste the retrieved context, add the question, ask for an answer.
llm(prompt, max_tokens=200, stop=["\n"]) generates text, capped at 200 tokens.

Heads up: Running Llama 2 locally is slower than the OpenAI API, especially on CPU. Small models (like 7B) are faster, but less capable. Don’t expect miracles.