Welcome back to the Ruby for AI series. Last time we built semantic search with pgvector. Now we're taking it further — building a full Retrieval-Augmented Generation (RAG) pipeline in Rails.
RAG is the pattern behind every "chat with your docs" product. Instead of hoping the LLM memorized the answer, you retrieve relevant context from your own data and augment the prompt with it. The LLM generates a response grounded in your actual content.
Let's build one from scratch.
The Architecture
Here's what our RAG pipeline looks like:
- Ingest — Upload documents, chunk them, generate embeddings, store in Postgres
- Retrieve — User asks a question, we find the most relevant chunks via vector search
- Generate — We send those chunks + the question to OpenAI and get a grounded answer
Simple. No magic. Let's wire it up.
Step 1: Document Ingestion
We need a Document model and a Chunk model. Documents get split into chunks because LLMs have context limits and smaller pieces retrieve better.
rails generate model Document title:string content:text status:string
rails generate model Chunk document:references content:text embedding:vector{1536}
rails db:migrate
Now the chunking logic. There are fancy approaches, but overlapping fixed-size chunks work surprisingly well:
# app/services/document_chunker.rb
class DocumentChunker
CHUNK_SIZE = 500 # characters
OVERLAP = 100 # overlap between chunks
def initialize(document)
@document = document
end
def call
text = @document.content
chunks = []
start_pos = 0
while start_pos < text.length
chunk_text = text[start_pos, CHUNK_SIZE]
break if chunk_text.strip.empty?
chunks << @document.chunks.create!(content: chunk_text)
start_pos += CHUNK_SIZE - OVERLAP
end
chunks
end
end
Step 2: Generate Embeddings for Chunks
We built the embedding service in the last post. Let's use it to process all chunks:
# app/services/chunk_embedder.rb
class ChunkEmbedder
def initialize(chunk)
@chunk = chunk
end
def call
client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)
response = client.embeddings(
parameters: {
model: "text-embedding-3-small",
input: @chunk.content
}
)
embedding = response.dig("data", 0, "embedding")
@chunk.update!(embedding: embedding)
end
end
Wire it together with an ingestion job:
# app/jobs/ingest_document_job.rb
class IngestDocumentJob < ApplicationJob
queue_as :default
def perform(document_id)
document = Document.find(document_id)
document.update!(status: "processing")
chunks = DocumentChunker.new(document).call
chunks.each { |chunk| ChunkEmbedder.new(chunk).call }
document.update!(status: "ready")
end
end
Upload a document and kick it off:
doc = Document.create!(title: "Rails Guide", content: long_text, status: "pending")
IngestDocumentJob.perform_later(doc.id)
Step 3: Retrieval
When a user asks a question, embed their query and find the closest chunks:
# app/services/retriever.rb
class Retriever
TOP_K = 5
def initialize(query)
@query = query
end
def call
client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)
response = client.embeddings(
parameters: {
model: "text-embedding-3-small",
input: @query
}
)
query_embedding = response.dig("data", 0, "embedding")
Chunk.nearest_neighbors(:embedding, query_embedding, distance: "cosine")
.limit(TOP_K)
end
end
This returns the 5 most relevant chunks. pgvector handles the heavy lifting.
Step 4: Generation with Context
Now the actual RAG part — we build a prompt with the retrieved context and send it to the LLM:
# app/services/rag_generator.rb
class RagGenerator
def initialize(query, chunks)
@query = query
@chunks = chunks
end
def call
client = OpenAI::Client.new(access_token: Rails.application.credentials.openai_api_key)
context = @chunks.map(&:content).join("\n\n---\n\n")
response = client.chat(
parameters: {
model: "gpt-4o",
messages: [
{
role: "system",
content: "You are a helpful assistant. Answer the user's question based ONLY on the provided context. If the context doesn't contain the answer, say so. Do not make things up."
},
{
role: "user",
content: "Context:\n#{context}\n\n---\n\nQuestion: #{@query}"
}
],
temperature: 0.3
}
)
response.dig("choices", 0, "message", "content")
end
end
Notice temperature: 0.3 — we want factual, grounded responses, not creative ones.
Step 5: The Controller
Tie it all together:
# app/controllers/questions_controller.rb
class QuestionsController < ApplicationController
def create
query = params[:query]
chunks = Retriever.new(query).call
answer = RagGenerator.new(query, chunks).call
render json: {
answer: answer,
sources: chunks.map { |c| { document: c.document.title, excerpt: c.content.truncate(100) } }
}
end
end
# config/routes.rb
resources :questions, only: [:create]
Test it:
curl -X POST http://localhost:3000/questions \
-H "Content-Type: application/json" \
-d '{"query": "How do migrations work in Rails?"}'
You get an answer grounded in your actual documents, plus source references so users can verify.
Key Takeaways
Chunk size matters. Too small and you lose context. Too big and you waste token budget. 300-500 characters is a good starting point. Experiment.
Overlap prevents lost context. Without overlap, a sentence split across two chunks might never get retrieved. 50-100 character overlap fixes this.
The system prompt is your guardrail. "Answer ONLY based on the provided context" prevents hallucination. Without it, the LLM will happily make things up.
This is production-ready architecture. The same pattern powers every RAG product you've used. The difference is in the details — better chunking strategies, re-ranking, hybrid search. But the bones are exactly this.
Next up: we'll build AI agents in Rails with tool use and function calling. The LLM stops just answering questions and starts doing things.
Top comments (0)