How I Built a RAG System in Rails Using Nomic Embeddings and OpenAI

#rag #openai #ai #rails

Retrieval-Augmented Generation (RAG) lets you bring your own data to LLMs—and get real answers. I’ll show how I used the open-source nomic-embed-text-v2-moe model for semantic search in a Rails app, while still using OpenAI for generation.

🧠 What is RAG?

RAG (Retrieval-Augmented Generation) enhances LLMs by feeding them relevant chunks of your data before generating a response. Instead of fine-tuning, we give the model useful context.

Here's the basic pipeline:

[ User Question ]
        ↓
[ Embed the Question (Nomic) ]
        ↓
[ Vector Search in PgVector ]
        ↓
[ Retrieve Relevant Chunks ]
        ↓
[ Assemble Prompt ]
        ↓
[ Generate Answer with OpenAI ]

🧰 The Stack

Rails – Backend framework, routes, controllers, and persistence
Nomic Embedding Model – For semantic understanding of data
FastAPI – Lightweight Python server to serve embeddings
PgVector – PostgreSQL extension to store and query vector data
OpenAI GPT-4 / GPT-3.5 – For the final response generation

🛠 Step 1: Run the Nomic Model Locally (Optional but Fast)

You can run the nomic-embed-text-v2-moe model using sentence-transformers in a Python FastAPI app:

from fastapi import FastAPI, Request
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe")

@app.post("/embed")
async def embed(req: Request):
    data = await req.json()
    input_text = data["input"]
    embedding = model.encode(input_text).tolist()
    return { "embedding": embedding }

This becomes your internal embedding API, replacing OpenAI’s /embeddings.

📄 Step 2: Chunk and Store Your Data

Split your content into short passages (~100–300 words), embed them via your FastAPI endpoint, and store the results in PostgreSQL with pgvector.

Add a vector column:

psql -d your_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

class AddEmbeddingToDocuments < ActiveRecord::Migration[7.1]
  def change
    add_column :documents, :embedding, :vector, limit: 768 # Nomic v2-moe size
  end
end

🤖 Step 3: Embed User Queries via Nomic

In your Rails controller:

def get_embedding(text)
  response = Faraday.post("http://localhost:8000/embed", { input: text }.to_json,
                          "Content-Type" => "application/json")
  JSON.parse(response.body)["embedding"]
end

Use the same model for both document and query embeddings.

🔍 Step 4: Perform Vector Search with PgVector

Search your documents for the closest matches using cosine distance:

Document.order("embedding <-> cube(array[?])", query_vector).limit(5)

These top chunks become the context for the LLM.

🧾 Step 5: Build a Smart Prompt for OpenAI

Concatenate the top passages and feed them into OpenAI’s chat API:

client.chat(
  parameters: {
    model: "gpt-4",
    messages: [
      { role: "system", content: "You are an assistant answering based on the provided context." },
      { role: "user", content: build_contextual_prompt(user_input, top_chunks) }
    ]
  }
)

✅ Why Use Nomic for Embeddings?

High-quality, open-source, multilingual
No token limits — runs locally or self-hosted
Zero vendor lock-in at the embedding layer
Great performance on MTEB and real-world retrieval

💡 Why I Still Use OpenAI for the LLM

The generation step is where OpenAI shines. Instead of replacing it prematurely, I decoupled the embedding stage. Now I can experiment, optimize, and even switch LLMs later if needed.