There is a common assumption in the industry that if you want to build serious AI-powered applications, you need to abandon your current stack and go Python. Spin up FastAPI, learn LangChain, containerise everything in separate services, and suddenly your Rails monolith starts to feel like something you should be ashamed of.
Ruby — and Rails — already gives you what you need to build LLM powered features in production: background jobs, database persistence, a mature ORM, solid HTTP clients, and many years of patterns for building systems that actually work! You don't need to rewrite everything. You need to think differently about how you use what you already have.
This series of articles are about that.
Part 1: Why your tests are letting you down
Most Rails projects I worked before reached for the same tools when it was time to write tests: RSpec, FactoryBot, some mocks, and the comfortable certainty that if you put the same input in, you get the same output out.
This worked well for a long time. But when you put an LLM inside your stack, this foundation starts to break in ways that are not immediately obvious — and your test suite will stay green while lying to your face.
The issue, as you probably already know, is non-determinism.
An LLM with the same input won't always return the same output. Temperature configurations, model updates, small changes in context — all of this can produce responses that mean exactly the same thing but are written completely different. A test like:
expect(response).to eq("I cannot help with that")
will fail when the model decides to say "That's outside what I can assist with" — even though both are correct answers.
This is more than a small annoyance. It reveals a fundamental incompatibility between how we verify traditional software and how we need to verify LLM behaviour.
What breaks first
The first thing that dies is integration tests. Any test that makes assertions directly on LLM output content will become flaky by nature.
And the situation gets even worse when you start fine-tuning models.
How do you write a regression test to confirm that your fine-tuned model is actually better than the base model? You cannot use assert_equal for this. You need a completely different approach!
A better mental model: confidence!
Instead of asking "did the model return this exact string?", we now focus on scores, thresholds, and confidence over a sample of responses. Measuring these attributes will make it possible to assert if the result is within the acceptable spectrum.
Some available options are:
- Semantic similarity — you use embeddings to compare the response against a reference answer, measuring meaning rather than string matching.
- LLM-as-judge — you use a second LLM call to evaluate the response against a rubric. This is how a lot of serious production eval pipelines work today.
- Ragas-style metrics — Ragas is a Python library that defines a set of reference-free evaluation metrics specifically designed for LLM applications. We can't use it directly in Ruby, but we can borrow the ideas and implement the same logic using the building blocks we already have — LLM calls and embeddings.
Before exploring them, here is a nice tip to consider: keep track of your past test results, so you can compare them over time!
class CreateEvalResults < ActiveRecord::Migration[8.0]
def change
enable_extension "vector" # only needed if you plan to use pgvector later
create_table :eval_results do |t|
t.text :prompt, null: false
t.text :response, null: false
t.text :reference
t.string :evaluator_type, null: false
t.float :score, null: false
t.boolean :passed, null: false
t.jsonb :details, default: {}
t.vector :embedding, limit: 1536 # matches text-embedding-3-small, optional
t.string :model_version
t.timestamps
end
add_index :eval_results, :evaluator_type
add_index :eval_results, :passed
add_index :eval_results, :model_version
end
end
The details column is jsonb because each evaluator returns different metadata. The model_version column is how you compare your fine-tuned model against the base model over time. The embedding column is optional for now — I will explain when it becomes useful.
Semantic similarity
The idea is simple — embed both your reference answer and the model response, then measure the cosine distance between the two vectors. If they are close enough, the response is good.
# Gemfile
gem "neighbor"
# model
class EvalResult < ApplicationRecord
has_neighbors :embedding
end
# helper
module EmbeddingHelper
private
def embed(text)
result = @client.embeddings(
parameters: { model: "text-embedding-3-small", input: text }
)
result.dig("data", 0, "embedding")
end
def cosine_similarity(a, b)
dot = a.zip(b).sum { |x, y| x * y }
dot / (Math.sqrt(a.sum { |x| x**2 }) * Math.sqrt(b.sum { |x| x**2 }))
end
end
# evaluator
class SemanticSimilarityEvaluator
include EmbeddingHelper
SIMILARITY_THRESHOLD = 0.80 # MRL systematically produces lower absolute cosine similarity scores than older models like ada-002
def initialize(openai_client)
@client = openai_client
end
def evaluate(prompt:, response:, reference:, model_version:)
response_embedding = embed(response)
reference_embedding = embed(reference)
similarity = cosine_similarity(response_embedding, reference_embedding)
passed = similarity >= SIMILARITY_THRESHOLD
EvalResult.create!(
prompt: prompt,
response: response,
reference: reference,
evaluator_type: "semantic_similarity",
score: similarity,
passed: passed,
embedding: response_embedding,
model_version: model_version,
details: { similarity: similarity }
)
{ score: similarity, passed: passed }
end
end
The embedding column becomes useful later when you have thousands of eval results and want to query for semantically similar past responses — for example to debug why your model is regressing on certain inputs. For that you would add the neighbor gem and use pgvector's nearest neighbour search. But do not worry about that until you actually need it.
LLM-as-judge
This one feels recursive at first — using a LLM to evaluate another LLM — but it works surprisingly well. You define a rubric and let a stronger model score the response:
class LlmJudgeEvaluator
JUDGE_PROMPT = <<~PROMPT
You are evaluating the quality of an AI assistant response.
Original question: %{prompt}
Response to evaluate: %{response}
Score the response from 1 to 5 on each of these criteria:
- Relevance: does it actually answer the question?
- Accuracy: is the information correct?
- Tone: is it appropriate and professional?
Respond in JSON only: {"relevance": X, "accuracy": X, "tone": X, "reasoning": "..."}
PROMPT
def initialize(openai_client)
@client = openai_client
end
def evaluate(prompt:, response:, model_version:)
result = @client.chat(
parameters: {
model: model_version,
response_format: { type: "json_object" },
messages: [{
role: "user",
content: format(JUDGE_PROMPT, prompt: prompt, response: response)
}]
}
)
scores = JSON.parse(result.dig("choices", 0, "message", "content"))
average = scores.slice("relevance", "accuracy", "tone").values.sum / 3.0
passed = average >= 3.5
EvalResult.create!(
prompt: prompt,
response: response,
evaluator_type: "llm_judge",
score: average,
passed: passed,
model_version: model_version,
details: scores
)
scores.merge("average" => average, "passed" => passed)
end
end
Ragas-style metrics
Ragas is a Python library that defines reference-free evaluation metrics for LLM applications. We can borrow its ideas and implement the same logic with what we already have.
Faithfulness — does the response stick to what the source context says, or is it hallucinating? It works by decomposing the response into individual claims and checking each one against the context:
class FaithfulnessEvaluator
include EmbeddingHelper
PROMPT = <<~PROMPT
Given the following context and response, extract each factual claim
from the response and determine if it can be inferred from the context.
Context: %{context}
Response: %{response}
Respond in JSON only:
{
"claims": [
{"claim": "...", "supported": true/false}
]
}
PROMPT
# FaithfulnessEvaluator.new(client).evaluate(
# prompt: "What causes rain?",
# context: "Rain forms through condensation, where atmospheric water vapor cools and clings to tiny particles—such as dust, smoke, or sea salt—known as cloud condensation nuclei. These droplets accumulate to form clouds, merging into larger droplets until they are heavy enough to fall as rain, a key part of the water cycle.",
# response: "Rain is caused by water vapor condensing around dust particles to form droplets that fall from clouds.",
# model_version: "openai/gpt-4o"
# )
#
# {score: 1.0,
# claims:
# [{"claim" => "Rain is caused by water vapor condensing around dust particles.", "supported" => true},
# {"claim" => "Droplets fall from clouds.", "supported" => true}],
# passed: true}
def initialize(openai_client)
@client = openai_client
end
def evaluate(prompt:, response:, context:, model_version:)
result = @client.chat(
parameters: {
model: model_version,
response_format: { type: "json_object" },
messages: [{
role: "user",
content: format(PROMPT, context: context, response: response)
}]
}
)
claims = JSON.parse(result.dig("choices", 0, "message", "content"))["claims"]
supported = claims.count { |c| c["supported"] }
score = supported.to_f / claims.size
passed = score >= 0.8
EvalResult.create!(
prompt: prompt,
response: response,
evaluator_type: "faithfulness",
score: score,
passed: passed,
model_version: model_version,
details: { claims: claims }
)
{ score: score, claims: claims, passed: passed }
end
end
Answer relevancy — instead of directly scoring relevance, it generates questions the response would answer well, then measures semantic similarity against the original question. A relevant response should produce questions that look like the original one:
class AnswerRelevancyEvaluator
include EmbeddingHelper
PROMPT = <<~PROMPT
Given the following response, generate 3 questions that this
response would be a good answer to.
Response: %{response}
Respond in JSON only: {"questions": ["...", "...", "..."]}
PROMPT
# AnswerRelevancyEvaluator.new(client).evaluate(
# prompt: "Why did the Byzantine Empire last longer than the Western Roman Empire?",
# response: "The Byzantine (Eastern Roman) Empire lasted roughly 1,000 years longer than the Western Roman Empire (ending in 1453 vs. 476 AD) due to its superior strategic location, stronger economy, and a more robust bureaucratic government.",
# model_version: "openai/gpt-4o")
def initialize(openai_client)
@client = openai_client
end
def evaluate(prompt:, response:, model_version:)
result = @client.chat(
parameters: {
model: model_version,
response_format: { type: "json_object" },
messages: [{
role: "user",
content: format(PROMPT, response: response)
}]
}
)
generated_questions = JSON.parse(
result.dig("choices", 0, "message", "content")
)["questions"]
original_embedding = embed(prompt)
similarities = generated_questions.map { |q| cosine_similarity(original_embedding, embed(q)) }
score = similarities.sum / similarities.size
passed = score >= 0.8
EvalResult.create!(
prompt: prompt,
response: response,
evaluator_type: "answer_relevancy",
score: score,
passed: passed,
model_version: model_version,
details: { generated_questions: generated_questions, similarities: similarities }
)
{ score: score, passed: passed }
end
end
Context precision — checks whether the context chunks that actually contributed to the answer are ranked higher than the ones that did not. Useful when you are using RAG and want to know if your retrieval is doing its job:
class ContextPrecisionEvaluator
PROMPT = <<~PROMPT
Given the question and the following context chunk, did this chunk
contribute useful information to answer the question?
Question: %{question}
Context chunk: %{chunk}
Respond in JSON only: {"useful": true/false, "reasoning": "..."}
PROMPT
# ContextPrecisionEvaluator.new(client).evaluate(
# prompt: "What caused the 2008 financial crisis?",
# response: "The 2008 financial crisis was caused by the collapse of the housing bubble, risky mortgage-backed securities, and excessive leverage in the banking system.",
# context_chunks: [
# "The 2008 financial crisis was triggered by a collapse in US housing prices and widespread defaults on subprime mortgages.",
# "Mortgage-backed securities bundled risky loans and were sold to investors globally, spreading the risk throughout the financial system.",
# "Banks were highly leveraged, meaning small losses in assets could wipe out their capital and trigger insolvency.",
# "The collapse of Lehman Brothers in September 2008 accelerated the global financial panic."
# ],
# model_version: "openai/gpt-4o"
# )
def initialize(openai_client)
@client = openai_client
end
def evaluate(prompt:, response:, context_chunks:, model_version:)
verdicts = context_chunks.each_with_index.map do |chunk, index|
result = @client.chat(
parameters: {
model: model_version,
response_format: { type: "json_object" },
messages: [{
role: "user",
content: format(PROMPT, question: prompt, chunk: chunk)
}]
}
)
parsed = JSON.parse(result.dig("choices", 0, "message", "content"))
{ index: index, chunk: chunk }.merge(parsed)
end
useful_count = verdicts.count { |v| v["useful"] }
score = useful_count.to_f / context_chunks.size
passed = score >= 0.7
EvalResult.create!(
prompt: prompt,
response: response,
evaluator_type: "context_precision",
score: score,
passed: passed,
model_version: model_version,
details: { verdicts: verdicts }
)
{ score: score, verdicts: verdicts, passed: passed }
end
end
The next article goes deeper into how you wire all of this together into a proper eval pipeline — persisting results, tracking scores across model versions, and making confident decisions about when a fine-tuned model is actually ready for production.
Top comments (0)