If you've ever tried using ChatGPT to answer questions about your company's docs or codebase, you know the pain: hallucinations, half-right answers, or just plain nonsense. Retrieval-Augmented Generation (RAG) is supposed to fix this, right? But what happens when you swap out the fancy OpenAI API for open-source models? That’s where things got interesting for me—and not always in a good way.
Why Go Open-Source for RAG?
I wanted to build a RAG pipeline that didn’t rely on proprietary APIs or cloud costs. Maybe you’ve hit rate limits, or your data can’t leave your network. Open-source LLMs promise freedom, but they come with their own quirks.
The thing is, RAG isn’t a magic bullet. The idea is simple: retrieve relevant context, shove it into an LLM, and hope for better answers. But when you switch to open-source models, you start seeing the seams.
Building the Pipeline: What Actually Works
I’ll walk through the stack I landed on, with code you can actually run. For context, I used sentence-transformers for embeddings and llama.cpp (via llama-cpp-python) for the LLM. I chose these because they’re popular, actively maintained, and don’t require a GPU (though you’ll want one if your docs are big).
Step 1: Chunking and Embedding Documents
You have to chop your docs up before embedding. If you feed giant blobs, retrieval gets fuzzy and slow. Here’s a basic way to chunk and embed using sentence-transformers:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained embedding model (all-MiniLM-L6-v2 is small and fast)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example document (could be your README, docs, etc.)
doc = """
RAG combines retrieval and text generation.
It retrieves relevant info from your docs, then generates answers.
Open-source LLMs can be used instead of OpenAI.
"""
# Simple chunking: split by sentences
chunks = doc.strip().split('\n')
# Embed each chunk
embeddings = model.encode(chunks) # shape: (num_chunks, embedding_dim)
# Store embeddings for later retrieval
chunk_db = dict(zip(chunks, embeddings))
Key lines explained:
-
SentenceTransformer('all-MiniLM-L6-v2')loads a small, fast embedding model. -
chunksis just splitting by line/sentence. For real docs, you'll want smarter chunking (paragraphs, sliding windows). -
embeddingsgives you a vector for each chunk—these are what you’ll use to find relevant context.
Step 2: Retrieval — Matching Queries to Chunks
Now, when a user asks a question, you want to find the most relevant chunk(s). Cosine similarity is your friend.
from sklearn.metrics.pairwise import cosine_similarity
def retrieve_context(query, chunk_db, model, top_k=2):
# Embed the query
query_vec = model.encode([query])
# Calculate similarities to each chunk
chunk_texts = list(chunk_db.keys())
chunk_vecs = np.array(list(chunk_db.values()))
sims = cosine_similarity(query_vec, chunk_vecs)[0]
# Get top_k chunks
top_indices = np.argsort(sims)[-top_k:][::-1]
return [chunk_texts[i] for i in top_indices]
# Example usage
user_query = "How does RAG use open-source LLMs?"
context_chunks = retrieve_context(user_query, chunk_db, model)
print("Retrieved context:", context_chunks)
Key lines explained:
-
cosine_similarityfinds which chunks are closest to your query. -
top_kgets the most relevant pieces. If your docs are big, tune this. - The retrieval step is surprisingly fast even on a laptop.
Step 3: Generation — Using llama.cpp as the LLM
Here’s where things get real. Open-source LLMs are slower and have stricter input limits than OpenAI’s APIs. You often have to trim context or pick smaller models.
I used llama-cpp-python to run a local Llama 2 model. Here’s a basic generation example:
from llama_cpp import Llama
# Initialize the model (point to where you downloaded your Llama. e.g. 'llama-2-7b.Q4_0.gguf')
llm = Llama(model_path="llama-2-7b.Q4_0.gguf", n_ctx=2048)
def rag_answer(query, context_chunks, llm):
# Construct prompt (simple, but effective)
context = "\n".join(context_chunks)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
# Generate
output = llm(prompt, max_tokens=200, stop=["\n"])
return output['choices'][0]['text'].strip()
# Example usage
answer = rag_answer(user_query, context_chunks, llm)
print("LLM Answer:", answer)
Key lines explained:
-
Llama(model_path, n_ctx=2048)loads the model with a context window. If you go bigger, you need more RAM. - The prompt is simple: paste the retrieved context, add the question, ask for an answer.
-
llm(prompt, max_tokens=200, stop=["\n"])generates text, capped at 200 tokens.
Heads up: Running Llama 2 locally is slower than the OpenAI API, especially on CPU. Small models (like 7B) are faster, but less capable. Don’t expect miracles.
What Surprised Me
I expected RAG to be plug-and-play. It’s not. Here are a few things that caught me off guard:
- Context window limits are tighter. Llama 2 (7B) has a 2048 token window by default. You have to be careful not to overload it, or your prompt gets truncated.
- Prompt formatting matters more. OpenAI’s models are forgiving, but open-source LLMs really care how you phrase things. A small tweak can make or break your answers.
- Retrieval quality makes or breaks it. If your chunks are too big or too small, retrieval gets noisy. I spent a weekend fiddling with chunk sizes and overlap.
Common Mistakes
Here are a few pitfalls I’ve seen (and fallen into myself):
- Ignoring token limits. You paste in a ton of context, but the model quietly ignores half of it. Always check your model’s max context length.
- Bad chunking strategy. If you just split by lines or random sizes, retrieval gets messy. Use semantic chunking—paragraphs, or even sentence windows with overlap.
- Unclear prompts. Open-source LLMs aren’t as robust as GPT-4. If your prompt doesn’t clearly separate context from question, or you don’t specify what kind of answer you want, you’ll get garbage.
Key Takeaways
- Open-source RAG pipelines are doable, but you need to tune chunking, retrieval, and prompts more carefully than with OpenAI.
- Context window size is a hard limit—don’t ignore it, or your answers will suffer.
- Retrieval quality directly impacts generation quality. Invest time in good chunking and embedding strategies.
- Running LLMs locally is slower and less “magic”—you trade API convenience for control and privacy.
- Prompt engineering is not optional: test and iterate to get reliable answers.
Building a RAG pipeline with open-source tools taught me a lot about the nitty-gritty details you never see in demos. If you’re willing to tinker, you’ll get a system that’s yours—and honestly, that’s pretty satisfying.
If you found this helpful, check out more programming tutorials on our blog. We cover Python, JavaScript, Java, Data Science, and more.
Top comments (0)