DEV Community: Paulo Henrique Castro

Ruby is all you need (Part II)

Paulo Henrique Castro — Thu, 16 Apr 2026 10:35:31 +0000

From Eval to Production: A Ruby and Rails Approach

If you read the first article, you now have a set of evaluators that can score your LLM responses — semantic similarity, LLM-as-judge, faithfulness, answer relevancy, context precision. You have a model_version column in your eval_results table. You are storing scores over time.

Now what? How do you actually use all of this to make shipping decisions?

The problem with shipping model changes

Deploying a new model version is not like deploying a code change. With code, you can read a diff. You can write a test that covers the change. You can be reasonably confident that if your test suite passes, you did not break anything.

With models it is different. You fine-tune, you run some manual tests, things look good, you ship — and then three days later a user reports that the assistant is giving wrong answers on a specific type of question you never thought to test.
Sound familiar?

The eval pipeline exists to fix exactly this. But only if you wire it into your deployment process.

Tracking model deployments

First, you need a way to track which model version is active and what percentage of traffic it is serving:

class CreateModelDeployments < ActiveRecord::Migration[8.0]
  def change
    create_table :model_deployments do |t|
      t.string :model_version, null: false
      t.string :status, null: false, default: "pending"
      t.integer :traffic_percentage, null: false, default: 0
      t.float :min_eval_score, null: false, default: 0.8
      t.jsonb :eval_summary, default: {}
      t.timestamps
    end

    add_index :model_deployments, :status
    add_index :model_deployments, :model_version, unique: true
  end
end

And the model:

class ModelDeployment < ApplicationRecord
  STATUSES = %w[pending evaluating approved rejected active rolled_back].freeze

  validates :status, inclusion: { in: STATUSES }
  validates :traffic_percentage, inclusion: { in: 0..100 }

  def self.active
    find_by(status: "active")
  end

  def self.candidate
    find_by(status: "approved")
  end
end

Routing traffic between model versions

When you make an LLM call, you need to decide which model version to use. You route based on the current deployment state:

class ModelRouter
  FALLBACK_MODEL = "openai/gpt-4o".freeze

  def self.resolve(session_id:)
    active = ModelDeployment.active
    candidate = ModelDeployment.candidate

    # no active deployment yet — use fallback
    unless active
      Rails.logger.warn("No active model deployment found. Using fallback: #{FALLBACK_MODEL}")
      return FALLBACK_MODEL
    end

    # no candidate being tested — use active
    return active.model_version unless candidate

    # deterministic routing based on session_id
    # same session always gets the same model version
    bucket = Digest::MD5.hexdigest(session_id).to_i(16) % 100
    bucket < candidate.traffic_percentage ? candidate.model_version : active.model_version
  end
end

The deterministic routing is important — hashing the session ID means the same user always gets the same model version during the experiment, avoiding jarring inconsistencies mid-conversation.

Running evals asynchronously with Active Job

The eval pipeline job ties everything together. Because each evaluator has a different signature depending on what inputs it needs, we use a dispatcher pattern rather than calling them all with a generic argument list:

class EvalPipelineJob < ApplicationJob
  queue_as :evals

  def perform(prompt:, response:, model_version:, reference: nil, context: nil, context_chunks: nil)
    client = OpenAI::Client.new(
      access_token: Rails.application.credentials.openai.api_key
    )

    run_evaluator(LlmJudgeEvaluator, client,
      prompt: prompt, response: response, model_version: model_version)

    if reference.present?
      run_evaluator(SemanticSimilarityEvaluator, client,
        prompt: prompt, response: response, reference: reference, model_version: model_version)
    end

    if context.present?
      run_evaluator(FaithfulnessEvaluator, client,
        prompt: prompt, response: response, context: context, model_version: model_version)
    end

    run_evaluator(AnswerRelevancyEvaluator, client,
      prompt: prompt, response: response, model_version: model_version)

    if context_chunks.present?
      run_evaluator(ContextPrecisionEvaluator, client,
        prompt: prompt, response: response, context_chunks: context_chunks, model_version: model_version)
    end

    ModelPromotionJob.perform_later(model_version: model_version)
  end

  private

  def run_evaluator(klass, client, **args)
    klass.new(client).evaluate(**args)
  rescue => e
    Rails.logger.error("#{klass} failed: #{e.message}")
  end
end

Each evaluator is only called when the inputs it needs are actually present. LlmJudgeEvaluator and AnswerRelevancyEvaluator run on every request since they only need the prompt and response. The others run when context or reference data is available.

In your controller, you never block the response waiting for evals:

class AssistantController < ApplicationController
  def chat
    model_version = ModelRouter.resolve(session_id: session.id)
    context_chunks = KnowledgeBase.retrieve(params[:message])
    context = context_chunks.join("\n")

    response_text = llm_client.chat(
      model: model_version,
      messages: [{ role: "user", content: params[:message] }]
    ).dig("choices", 0, "message", "content")

    EvalPipelineJob.perform_later(
      prompt: params[:message],
      response: response_text,
      model_version: model_version,
      context: context,
      context_chunks: context_chunks
    )

    render json: { response: response_text }
  end
end

Automated promotion

Once enough eval results have accumulated for a candidate model, the promotion job decides whether to promote or reject it. Critically, we only score results collected after the deployment was created — otherwise we risk mixing in results from a previous deployment that happened to use the same model identifier:

class ModelPromotionJob < ApplicationJob
  queue_as :evals

  MIN_SAMPLE_SIZE = 50

  def perform(model_version:)
    deployment = ModelDeployment.find_by(
      model_version: model_version,
      status: "approved"
    )
    return unless deployment

    # only count results collected since this deployment was created
    results = EvalResult.where(
      model_version: model_version
    ).where("created_at >= ?", deployment.created_at)

    return if results.count < MIN_SAMPLE_SIZE

    summary = compute_summary(results)
    deployment.update!(eval_summary: summary)

    if summary[:average_score] >= deployment.min_eval_score &&
       summary[:pass_rate] >= 0.85
      promote!(deployment)
    elsif summary[:average_score] < deployment.min_eval_score * 0.9
      # significantly below threshold — reject early
      deployment.update!(status: "rejected")
      Rails.logger.warn(
        "Model #{model_version} rejected. Score: #{summary[:average_score]}"
      )
    end
  end

  private

  def compute_summary(results)
    {
      average_score: results.average(:score).to_f.round(3),
      pass_rate: (results.where(passed: true).count.to_f / results.count).round(3),
      sample_size: results.count,
      by_evaluator: results.group(:evaluator_type)
                           .average(:score)
                           .transform_values { |v| v.round(3) }
    }
  end

  def promote!(deployment)
    ActiveRecord::Base.transaction do
      ModelDeployment.where(status: "active").update_all(status: "rolled_back")
      deployment.update!(status: "active", traffic_percentage: 100)
    end

    Rails.logger.info("Model #{deployment.model_version} promoted to active.")
  end
end

The by_evaluator breakdown in the summary matters. A model can have a good average score overall but fail badly on faithfulness specifically — which is a red flag you would miss if you only looked at the aggregate.

Rolling back

Sometimes things go wrong in production even after a model passes evals. You want to be able to roll back fast:

class ModelRollbackJob < ApplicationJob
  queue_as :evals

  def perform(model_version:)
    ActiveRecord::Base.transaction do
      current = ModelDeployment.find_by(model_version: model_version)

      unless current
        Rails.logger.error("Rollback failed: no deployment found for #{model_version}")
        return
      end

      current.update!(status: "rolled_back", traffic_percentage: 0)

      previous = ModelDeployment.where(status: "rolled_back")
                                .where.not(model_version: model_version)
                                .order(updated_at: :desc)
                                .first

      unless previous
        Rails.logger.error(
          "Rollback failed: no previous deployment to restore. System has no active model."
        )
        raise "No previous deployment available for rollback"
      end

      previous.update!(status: "active", traffic_percentage: 100)
      Rails.logger.warn(
        "Model #{model_version} rolled back. Restored: #{previous.model_version}"
      )
    end
  end
end

The transaction ensures you never end up in a state where the current model is rolled back but nothing is restored. If there is no previous deployment to fall back to, we raise explicitly rather than silently leaving the system without an active model.

Putting it all together

The flow looks like this: you register a new model version as a candidate with status: "approved" and a traffic percentage — say ten percent. The router starts sending a portion of real traffic to it.

The eval pipeline accumulates scores in the background, scoped to the deployment window. Once you hit the minimum sample size, the promotion job decides automatically whether to promote to one hundred percent or reject.

If something goes wrong after promotion, rollback restores the previous version in a single transaction.

None of this required Python. No LangChain, no separate microservice, no new infrastructure.

Just Rails, Active Job, Postgres, and the evaluators we built in part one.

Shipping model changes is hard. But making it safe, measurable, and reversible — that is an engineering problem.

And engineering problems are what Rails is good at.

Ruby is All You Need

Paulo Henrique Castro — Sat, 28 Mar 2026 08:33:58 +0000

There is a common assumption in the industry that if you want to build serious AI-powered applications, you need to abandon your current stack and go Python. Spin up FastAPI, learn LangChain, containerise everything in separate services, and suddenly your Rails monolith starts to feel like something you should be ashamed of.

Ruby — and Rails — already gives you what you need to build LLM powered features in production: background jobs, database persistence, a mature ORM, solid HTTP clients, and many years of patterns for building systems that actually work! You don't need to rewrite everything. You need to think differently about how you use what you already have.

This series of articles are about that.

Part 1: Why your tests are letting you down

Most Rails projects I worked before reached for the same tools when it was time to write tests: RSpec, FactoryBot, some mocks, and the comfortable certainty that if you put the same input in, you get the same output out.

This worked well for a long time. But when you put an LLM inside your stack, this foundation starts to break in ways that are not immediately obvious — and your test suite will stay green while lying to your face.

The issue, as you probably already know, is non-determinism.

An LLM with the same input won't always return the same output. Temperature configurations, model updates, small changes in context — all of this can produce responses that mean exactly the same thing but are written completely different. A test like:

expect(response).to eq("I cannot help with that")

will fail when the model decides to say "That's outside what I can assist with" — even though both are correct answers.

This is more than a small annoyance. It reveals a fundamental incompatibility between how we verify traditional software and how we need to verify LLM behaviour.

What breaks first

The first thing that dies is integration tests. Any test that makes assertions directly on LLM output content will become flaky by nature.

And the situation gets even worse when you start fine-tuning models.

How do you write a regression test to confirm that your fine-tuned model is actually better than the base model? You cannot use assert_equal for this. You need a completely different approach!

A better mental model: confidence!

Instead of asking "did the model return this exact string?", we now focus on scores, thresholds, and confidence over a sample of responses. Measuring these attributes will make it possible to assert if the result is within the acceptable spectrum.

Some available options are:

Semantic similarity — you use embeddings to compare the response against a reference answer, measuring meaning rather than string matching.
LLM-as-judge — you use a second LLM call to evaluate the response against a rubric. This is how a lot of serious production eval pipelines work today.
Ragas-style metrics — Ragas is a Python library that defines a set of reference-free evaluation metrics specifically designed for LLM applications. We can't use it directly in Ruby, but we can borrow the ideas and implement the same logic using the building blocks we already have — LLM calls and embeddings.

Before exploring them, here is a nice tip to consider: keep track of your past test results, so you can compare them over time!

class CreateEvalResults < ActiveRecord::Migration[8.0]
  def change
    enable_extension "vector" # only needed if you plan to use pgvector later

    create_table :eval_results do |t|
      t.text :prompt, null: false
      t.text :response, null: false
      t.text :reference
      t.string :evaluator_type, null: false
      t.float :score, null: false
      t.boolean :passed, null: false
      t.jsonb :details, default: {}
      t.vector :embedding, limit: 1536 # matches text-embedding-3-small, optional
      t.string :model_version
      t.timestamps
    end

    add_index :eval_results, :evaluator_type
    add_index :eval_results, :passed
    add_index :eval_results, :model_version
  end
end

The details column is jsonb because each evaluator returns different metadata. The model_version column is how you compare your fine-tuned model against the base model over time. The embedding column is optional for now — I will explain when it becomes useful.

Semantic similarity

The idea is simple — embed both your reference answer and the model response, then measure the cosine distance between the two vectors. If they are close enough, the response is good.

# Gemfile
gem "neighbor"

# model
class EvalResult < ApplicationRecord
  has_neighbors :embedding
end

# helper
module EmbeddingHelper
  private

  def embed(text)
    result = @client.embeddings(
      parameters: { model: "text-embedding-3-small", input: text }
    )
    result.dig("data", 0, "embedding")
  end

  def cosine_similarity(a, b)
    dot = a.zip(b).sum { |x, y| x * y }
    dot / (Math.sqrt(a.sum { |x| x**2 }) * Math.sqrt(b.sum { |x| x**2 }))
  end
end

# evaluator
class SemanticSimilarityEvaluator
  include EmbeddingHelper
  SIMILARITY_THRESHOLD = 0.80 # MRL systematically produces lower absolute cosine similarity scores than older models like ada-002

  def initialize(openai_client)
    @client = openai_client
  end

  def evaluate(prompt:, response:, reference:, model_version:)
    response_embedding = embed(response)
    reference_embedding = embed(reference)
    similarity = cosine_similarity(response_embedding, reference_embedding)
    passed = similarity >= SIMILARITY_THRESHOLD

    EvalResult.create!(
      prompt: prompt,
      response: response,
      reference: reference,
      evaluator_type: "semantic_similarity",
      score: similarity,
      passed: passed,
      embedding: response_embedding,
      model_version: model_version,
      details: { similarity: similarity }
    )

    { score: similarity, passed: passed }
  end
end

The embedding column becomes useful later when you have thousands of eval results and want to query for semantically similar past responses — for example to debug why your model is regressing on certain inputs. For that you would add the neighbor gem and use pgvector's nearest neighbour search. But do not worry about that until you actually need it.

LLM-as-judge

This one feels recursive at first — using a LLM to evaluate another LLM — but it works surprisingly well. You define a rubric and let a stronger model score the response:

class LlmJudgeEvaluator
  JUDGE_PROMPT = <<~PROMPT
    You are evaluating the quality of an AI assistant response.

    Original question: %{prompt}
    Response to evaluate: %{response}

    Score the response from 1 to 5 on each of these criteria:
    - Relevance: does it actually answer the question?
    - Accuracy: is the information correct?
    - Tone: is it appropriate and professional?

    Respond in JSON only: {"relevance": X, "accuracy": X, "tone": X, "reasoning": "..."}
  PROMPT

  def initialize(openai_client)
    @client = openai_client
  end

  def evaluate(prompt:, response:, model_version:)
    result = @client.chat(
      parameters: {
        model: model_version,
        response_format: { type: "json_object" },
        messages: [{
          role: "user",
          content: format(JUDGE_PROMPT, prompt: prompt, response: response)
        }]
      }
    )

    scores = JSON.parse(result.dig("choices", 0, "message", "content"))
    average = scores.slice("relevance", "accuracy", "tone").values.sum / 3.0
    passed = average >= 3.5

    EvalResult.create!(
      prompt: prompt,
      response: response,
      evaluator_type: "llm_judge",
      score: average,
      passed: passed,
      model_version: model_version,
      details: scores
    )

    scores.merge("average" => average, "passed" => passed)
  end
end

Ragas-style metrics

Ragas is a Python library that defines reference-free evaluation metrics for LLM applications. We can borrow its ideas and implement the same logic with what we already have.

Faithfulness — does the response stick to what the source context says, or is it hallucinating? It works by decomposing the response into individual claims and checking each one against the context:

class FaithfulnessEvaluator
  include EmbeddingHelper

  PROMPT = <<~PROMPT
    Given the following context and response, extract each factual claim
    from the response and determine if it can be inferred from the context.

    Context: %{context}
    Response: %{response}

    Respond in JSON only:
    {
      "claims": [
        {"claim": "...", "supported": true/false}
      ]
    }
  PROMPT

  # FaithfulnessEvaluator.new(client).evaluate(
  #   prompt: "What causes rain?",
  #   context: "Rain forms through condensation, where atmospheric water vapor cools and clings to tiny particles—such as dust, smoke, or sea salt—known as cloud condensation nuclei. These droplets accumulate to form clouds, merging into larger droplets until they are heavy enough to fall as rain, a key part of the water cycle.",
  #   response: "Rain is caused by water vapor condensing around dust particles to form droplets that fall from clouds.",
  #   model_version: "openai/gpt-4o"
  # )
  #
  # {score: 1.0,
  #  claims:
  #   [{"claim" => "Rain is caused by water vapor condensing around dust particles.", "supported" => true},
  #    {"claim" => "Droplets fall from clouds.", "supported" => true}],
  #  passed: true}

  def initialize(openai_client)
    @client = openai_client
  end

  def evaluate(prompt:, response:, context:, model_version:)
    result = @client.chat(
      parameters: {
        model: model_version,
        response_format: { type: "json_object" },
        messages: [{
          role: "user",
          content: format(PROMPT, context: context, response: response)
        }]
      }
    )

    claims = JSON.parse(result.dig("choices", 0, "message", "content"))["claims"]
    supported = claims.count { |c| c["supported"] }
    score = supported.to_f / claims.size
    passed = score >= 0.8

    EvalResult.create!(
      prompt: prompt,
      response: response,
      evaluator_type: "faithfulness",
      score: score,
      passed: passed,
      model_version: model_version,
      details: { claims: claims }
    )

    { score: score, claims: claims, passed: passed }
  end
end

Answer relevancy — instead of directly scoring relevance, it generates questions the response would answer well, then measures semantic similarity against the original question. A relevant response should produce questions that look like the original one:

class AnswerRelevancyEvaluator
  include EmbeddingHelper

  PROMPT = <<~PROMPT
    Given the following response, generate 3 questions that this
    response would be a good answer to.

    Response: %{response}

    Respond in JSON only: {"questions": ["...", "...", "..."]}
  PROMPT

  # AnswerRelevancyEvaluator.new(client).evaluate(
  #  prompt: "Why did the Byzantine Empire last longer than the Western Roman Empire?",
  #  response: "The Byzantine (Eastern Roman) Empire lasted roughly 1,000 years longer than the Western Roman Empire (ending in 1453 vs. 476 AD) due to its superior strategic location, stronger economy, and a more robust bureaucratic government.",
  #  model_version: "openai/gpt-4o")

  def initialize(openai_client)
    @client = openai_client
  end

  def evaluate(prompt:, response:, model_version:)
    result = @client.chat(
      parameters: {
        model: model_version,
        response_format: { type: "json_object" },
        messages: [{
          role: "user",
          content: format(PROMPT, response: response)
        }]
      }
    )

    generated_questions = JSON.parse(
      result.dig("choices", 0, "message", "content")
    )["questions"]

    original_embedding = embed(prompt)
    similarities = generated_questions.map { |q| cosine_similarity(original_embedding, embed(q)) }
    score = similarities.sum / similarities.size
    passed = score >= 0.8

    EvalResult.create!(
      prompt: prompt,
      response: response,
      evaluator_type: "answer_relevancy",
      score: score,
      passed: passed,
      model_version: model_version,
      details: { generated_questions: generated_questions, similarities: similarities }
    )

    { score: score, passed: passed }
  end
end

Context precision — checks whether the context chunks that actually contributed to the answer are ranked higher than the ones that did not. Useful when you are using RAG and want to know if your retrieval is doing its job:

class ContextPrecisionEvaluator
  PROMPT = <<~PROMPT
    Given the question and the following context chunk, did this chunk
    contribute useful information to answer the question?

    Question: %{question}
    Context chunk: %{chunk}

    Respond in JSON only: {"useful": true/false, "reasoning": "..."}
  PROMPT

# ContextPrecisionEvaluator.new(client).evaluate(
#   prompt: "What caused the 2008 financial crisis?",
#   response: "The 2008 financial crisis was caused by the collapse of the housing bubble, risky mortgage-backed securities, and excessive leverage in the banking system.",
#   context_chunks: [
#     "The 2008 financial crisis was triggered by a collapse in US housing prices and widespread defaults on subprime mortgages.",
#     "Mortgage-backed securities bundled risky loans and were sold to investors globally, spreading the risk throughout the financial system.",
#     "Banks were highly leveraged, meaning small losses in assets could wipe out their capital and trigger insolvency.",
#     "The collapse of Lehman Brothers in September 2008 accelerated the global financial panic."
#   ],
#   model_version: "openai/gpt-4o"
# )

  def initialize(openai_client)
    @client = openai_client
  end

  def evaluate(prompt:, response:, context_chunks:, model_version:)
    verdicts = context_chunks.each_with_index.map do |chunk, index|
      result = @client.chat(
        parameters: {
          model: model_version,
          response_format: { type: "json_object" },
          messages: [{
            role: "user",
            content: format(PROMPT, question: prompt, chunk: chunk)
          }]
        }
      )

      parsed = JSON.parse(result.dig("choices", 0, "message", "content"))
      { index: index, chunk: chunk }.merge(parsed)
    end

    useful_count = verdicts.count { |v| v["useful"] }
    score = useful_count.to_f / context_chunks.size
    passed = score >= 0.7

    EvalResult.create!(
      prompt: prompt,
      response: response,
      evaluator_type: "context_precision",
      score: score,
      passed: passed,
      model_version: model_version,
      details: { verdicts: verdicts }
    )

    { score: score, verdicts: verdicts, passed: passed }
  end
end

The next article goes deeper into how you wire all of this together into a proper eval pipeline — persisting results, tracking scores across model versions, and making confident decisions about when a fine-tuned model is actually ready for production.