Retrieval-Augmented Generation (RAG) lets you bring your own data to LLMs—and get real answers. I’ll show how I used the open-source nomic-embed-text-v2-moe model for semantic search in a Rails app, while still using OpenAI for generation.
🧠 What is RAG?
RAG (Retrieval-Augmented Generation) enhances LLMs by feeding them relevant chunks of your data before generating a response. Instead of fine-tuning, we give the model useful context.
Here's the basic pipeline:
[ User Question ]
↓
[ Embed the Question (Nomic) ]
↓
[ Vector Search in PgVector ]
↓
[ Retrieve Relevant Chunks ]
↓
[ Assemble Prompt ]
↓
[ Generate Answer with OpenAI ]
🧰 The Stack
- Rails – Backend framework, routes, controllers, and persistence
- Nomic Embedding Model – For semantic understanding of data
- FastAPI – Lightweight Python server to serve embeddings
- PgVector – PostgreSQL extension to store and query vector data
- OpenAI GPT-4 / GPT-3.5 – For the final response generation
🛠 Step 1: Run the Nomic Model Locally (Optional but Fast)
You can run the nomic-embed-text-v2-moe model using sentence-transformers in a Python FastAPI app:
from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer
app = FastAPI()
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe")
@app.post("/embed")
async def embed(req: Request):
data = await req.json()
input_text = data["input"]
embedding = model.encode(input_text).tolist()
return { "embedding": embedding }
This becomes your internal embedding API, replacing OpenAI’s /embeddings
.
📄 Step 2: Chunk and Store Your Data
Split your content into short passages (~100–300 words), embed them via your FastAPI endpoint, and store the results in PostgreSQL with pgvector
.
Add a vector column:
psql -d your_db -c "CREATE EXTENSION IF NOT EXISTS vector;"
class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1]
def change
add_column :documents, :embedding, :vector, limit: 768 # Nomic v2-moe size
end
end
🤖 Step 3: Embed User Queries via Nomic
In your Rails controller:
def get_embedding(text)
response = Faraday.post("http://localhost:8000/embed", { input: text }.to_json,
"Content-Type" => "application/json")
JSON.parse(response.body)["embedding"]
end
Use the same model for both document and query embeddings.
🔍 Step 4: Perform Vector Search with PgVector
Search your documents for the closest matches using cosine distance:
Document.order("embedding <-> cube(array[?])", query_vector).limit(5)
These top chunks become the context for the LLM.
🧾 Step 5: Build a Smart Prompt for OpenAI
Concatenate the top passages and feed them into OpenAI’s chat API:
client.chat(
parameters: {
model: "gpt-4",
messages: [
{ role: "system", content: "You are an assistant answering based on the provided context." },
{ role: "user", content: build_contextual_prompt(user_input, top_chunks) }
]
}
)
✅ Why Use Nomic for Embeddings?
- High-quality, open-source, multilingual
- No token limits — runs locally or self-hosted
- Zero vendor lock-in at the embedding layer
- Great performance on MTEB and real-world retrieval
💡 Why I Still Use OpenAI for the LLM
The generation step is where OpenAI shines. Instead of replacing it prematurely, I decoupled the embedding stage. Now I can experiment, optimize, and even switch LLMs later if needed.
🧠 Takeaways
- RAG doesn’t need to be a heavyweight system.
- Open-source embeddings + OpenAI generation = powerful, flexible hybrid.
- PgVector + Rails makes vector search feel native and hackable.
Top comments (1)
Good!