Building a RAG System in Rails — Retrieval-Augmented Generation from Scratch

#ruby #tutorial #ai #rails

Welcome back to the Ruby for AI series. Last time we built semantic search with pgvector. Now we're taking it further — building a full Retrieval-Augmented Generation (RAG) pipeline in Rails.

RAG is the pattern behind every "chat with your docs" product. Instead of hoping the LLM memorized the answer, you retrieve relevant context from your own data and augment the prompt with it. The LLM generates a response grounded in your actual content.

Let's build one from scratch.

The Architecture

Here's what our RAG pipeline looks like:

Ingest — Upload documents, chunk them, generate embeddings, store in Postgres
Retrieve — User asks a question, we find the most relevant chunks via vector search
Generate — We send those chunks + the question to OpenAI and get a grounded answer

Simple. No magic. Let's wire it up.

Step 1: Document Ingestion

We need a Document model and a Chunk model. Documents get split into chunks because LLMs have context limits and smaller pieces retrieve better.

rails generate model Document title:string content:text status:string
rails generate model Chunk document:references content:text embedding:vector{1536}
rails db:migrate

Now the chunking logic. There are fancy approaches, but overlapping fixed-size chunks work surprisingly well:

# app/services/document_chunker.rb
class DocumentChunker
  CHUNK_SIZE = 500    # characters
  OVERLAP = 100       # overlap between chunks

  def initialize(document)
    @document = document
  end

  def call
    text = @document.content
    chunks = []
    start_pos = 0

    while start_pos < text.length
      chunk_text = text[start_pos, CHUNK_SIZE]
      break if chunk_text.strip.empty?

      chunks << @document.chunks.create!(content: chunk_text)
      start_pos += CHUNK_SIZE - OVERLAP
    end

    chunks
  end
end

Step 2: Generate Embeddings for Chunks

We built the embedding service in the last post. Let's use it to process all chunks:

# app/services/chunk_embedder.rb
class ChunkEmbedder
  def initialize(chunk)
    @chunk = chunk
  end

  def call
    client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)

    response = client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: @chunk.content
      }
    )

    embedding = response.dig("data", 0, "embedding")
    @chunk.update!(embedding: embedding)
  end
end

Wire it together with an ingestion job:

# app/jobs/ingest_document_job.rb
class IngestDocumentJob < ApplicationJob
  queue_as :default

  def perform(document_id)
    document = Document.find(document_id)
    document.update!(status: "processing")

    chunks = DocumentChunker.new(document).call

    chunks.each { |chunk| ChunkEmbedder.new(chunk).call }

    document.update!(status: "ready")
  end
end

Upload a document and kick it off:

doc = Document.create!(title: "Rails Guide", content: long_text, status: "pending")
IngestDocumentJob.perform_later(doc.id)

Step 3: Retrieval

When a user asks a question, embed their query and find the closest chunks:

# app/services/retriever.rb
class Retriever
  TOP_K = 5

  def initialize(query)
    @query = query
  end

  def call
    client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)

    response = client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: @query
      }
    )

    query_embedding = response.dig("data", 0, "embedding")

    Chunk.nearest_neighbors(:embedding, query_embedding, distance: "cosine")
         .limit(TOP_K)
  end
end

This returns the 5 most relevant chunks. pgvector handles the heavy lifting.

Step 4: Generation with Context

Now the actual RAG part — we build a prompt with the retrieved context and send it to the LLM:

# app/services/rag_generator.rb
class RagGenerator
  def initialize(query, chunks)
    @query = query
    @chunks = chunks
  end

  def call
    client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)

    context = @chunks.map(&:content).join("\n\n---\n\n")

    response = client.chat(
      parameters: {
        model: "gpt-4o",
        messages: [
          {
            role: "system",
            content: "You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain the answer, say so. Do not make things up."
          },
          {
            role: "user",
            content: "Context:\n#{context}\n\n---\n\nQuestion: #{@query}"
          }
        ],
        temperature: 0.3
      }
    )

    response.dig("choices", 0, "message", "content")
  end
end

Notice temperature: 0.3 — we want factual, grounded responses, not creative ones.

Step 5: The Controller

Tie it all together:

# app/controllers/questions_controller.rb
class QuestionsController < ApplicationController
  def create
    query = params[:query]

    chunks = Retriever.new(query).call
    answer = RagGenerator.new(query, chunks).call

    render json: {
      answer: answer,
      sources: chunks.map { |c| { document: c.document.title, excerpt: c.content.truncate(100) } }
    }
  end
end

# config/routes.rb
resources :questions, only: [:create]

Test it:

curl -X POST http://localhost:3000/questions \
  -H "Content-Type: application/json" \
  -d '{"query": "How do migrations work in Rails?"}'

You get an answer grounded in your actual documents, plus source references so users can verify.

Key Takeaways

Chunk size matters. Too small and you lose context. Too big and you waste token budget. 300-500 characters is a good starting point. Experiment.

Overlap prevents lost context. Without overlap, a sentence split across two chunks might never get retrieved. 50-100 character overlap fixes this.

The system prompt is your guardrail. "Answer ONLY based on the provided context" prevents hallucination. Without it, the LLM will happily make things up.

This is production-ready architecture. The same pattern powers every RAG product you've used. The difference is in the details — better chunking strategies, re-ranking, hybrid search. But the bones are exactly this.

Next up: we'll build AI agents in Rails with tool use and function calling. The LLM stops just answering questions and starts doing things.