Paulo Henrique Castro

Posted on Apr 16

Ruby is all you need (Part II)

#llm #rails #programming #ruby

From Eval to Production: A Ruby and Rails Approach

If you read the first article, you now have a set of evaluators that can score your LLM responses — semantic similarity, LLM-as-judge, faithfulness, answer relevancy, context precision. You have a model_version column in your eval_results table. You are storing scores over time.

Now what? How do you actually use all of this to make shipping decisions?

The problem with shipping model changes

Deploying a new model version is not like deploying a code change. With code, you can read a diff. You can write a test that covers the change. You can be reasonably confident that if your test suite passes, you did not break anything.

With models it is different. You fine-tune, you run some manual tests, things look good, you ship — and then three days later a user reports that the assistant is giving wrong answers on a specific type of question you never thought to test.
Sound familiar?

The eval pipeline exists to fix exactly this. But only if you wire it into your deployment process.

Tracking model deployments

First, you need a way to track which model version is active and what percentage of traffic it is serving:

class CreateModelDeployments < ActiveRecord::Migration[8.0]
  def change
    create_table :model_deployments do |t|
      t.string :model_version, null: false
      t.string :status, null: false, default: "pending"
      t.integer :traffic_percentage, null: false, default: 0
      t.float :min_eval_score, null: false, default: 0.8
      t.jsonb :eval_summary, default: {}
      t.timestamps
    end

    add_index :model_deployments, :status
    add_index :model_deployments, :model_version, unique: true
  end
end

And the model:

class ModelDeployment < ApplicationRecord
  STATUSES = %w[pending evaluating approved rejected active rolled_back].freeze

  validates :status, inclusion: { in: STATUSES }
  validates :traffic_percentage, inclusion: { in: 0..100 }

  def self.active
    find_by(status: "active")
  end

  def self.candidate
    find_by(status: "approved")
  end
end

Routing traffic between model versions

When you make an LLM call, you need to decide which model version to use. You route based on the current deployment state:

class ModelRouter
  FALLBACK_MODEL = "openai/gpt-4o".freeze

  def self.resolve(session_id:)
    active = ModelDeployment.active
    candidate = ModelDeployment.candidate

    # no active deployment yet — use fallback
    unless active
      Rails.logger.warn("No active model deployment found. Using fallback: #{FALLBACK_MODEL}")
      return FALLBACK_MODEL
    end

    # no candidate being tested — use active
    return active.model_version unless candidate

    # deterministic routing based on session_id
    # same session always gets the same model version
    bucket = Digest::MD5.hexdigest(session_id).to_i(16) % 100
    bucket < candidate.traffic_percentage ? candidate.model_version : active.model_version
  end
end

The deterministic routing is important — hashing the session ID means the same user always gets the same model version during the experiment, avoiding jarring inconsistencies mid-conversation.

Running evals asynchronously with Active Job

The eval pipeline job ties everything together. Because each evaluator has a different signature depending on what inputs it needs, we use a dispatcher pattern rather than calling them all with a generic argument list:

class EvalPipelineJob < ApplicationJob
  queue_as :evals

  def perform(prompt:, response:, model_version:, reference: nil, context: nil, context_chunks: nil)
    client = OpenAI::Client.new(
      access_token: Rails.application.credentials.openai.api_key
    )

    run_evaluator(LlmJudgeEvaluator, client,
      prompt: prompt, response: response, model_version: model_version)

    if reference.present?
      run_evaluator(SemanticSimilarityEvaluator, client,
        prompt: prompt, response: response, reference: reference, model_version: model_version)
    end

    if context.present?
      run_evaluator(FaithfulnessEvaluator, client,
        prompt: prompt, response: response, context: context, model_version: model_version)
    end

    run_evaluator(AnswerRelevancyEvaluator, client,
      prompt: prompt, response: response, model_version: model_version)

    if context_chunks.present?
      run_evaluator(ContextPrecisionEvaluator, client,
        prompt: prompt, response: response, context_chunks: context_chunks, model_version: model_version)
    end

    ModelPromotionJob.perform_later(model_version: model_version)
  end

  private

  def run_evaluator(klass, client, **args)
    klass.new(client).evaluate(**args)
  rescue => e
    Rails.logger.error("#{klass} failed: #{e.message}")
  end
end

Each evaluator is only called when the inputs it needs are actually present. LlmJudgeEvaluator and AnswerRelevancyEvaluator run on every request since they only need the prompt and response. The others run when context or reference data is available.

In your controller, you never block the response waiting for evals:

class AssistantController < ApplicationController
  def chat
    model_version = ModelRouter.resolve(session_id: session.id)
    context_chunks = KnowledgeBase.retrieve(params[:message])
    context = context_chunks.join("\n")

    response_text = llm_client.chat(
      model: model_version,
      messages: [{ role: "user", content: params[:message] }]
    ).dig("choices", 0, "message", "content")

    EvalPipelineJob.perform_later(
      prompt: params[:message],
      response: response_text,
      model_version: model_version,
      context: context,
      context_chunks: context_chunks
    )

    render json: { response: response_text }
  end
end

Automated promotion

Once enough eval results have accumulated for a candidate model, the promotion job decides whether to promote or reject it. Critically, we only score results collected after the deployment was created — otherwise we risk mixing in results from a previous deployment that happened to use the same model identifier:

class ModelPromotionJob < ApplicationJob
  queue_as :evals

  MIN_SAMPLE_SIZE = 50

  def perform(model_version:)
    deployment = ModelDeployment.find_by(
      model_version: model_version,
      status: "approved"
    )
    return unless deployment

    # only count results collected since this deployment was created
    results = EvalResult.where(
      model_version: model_version
    ).where("created_at >= ?", deployment.created_at)

    return if results.count < MIN_SAMPLE_SIZE

    summary = compute_summary(results)
    deployment.update!(eval_summary: summary)

    if summary[:average_score] >= deployment.min_eval_score &&
       summary[:pass_rate] >= 0.85
      promote!(deployment)
    elsif summary[:average_score] < deployment.min_eval_score * 0.9
      # significantly below threshold — reject early
      deployment.update!(status: "rejected")
      Rails.logger.warn(
        "Model #{model_version} rejected. Score: #{summary[:average_score]}"
      )
    end
  end

  private

  def compute_summary(results)
    {
      average_score: results.average(:score).to_f.round(3),
      pass_rate: (results.where(passed: true).count.to_f / results.count).round(3),
      sample_size: results.count,
      by_evaluator: results.group(:evaluator_type)
                           .average(:score)
                           .transform_values { |v| v.round(3) }
    }
  end

  def promote!(deployment)
    ActiveRecord::Base.transaction do
      ModelDeployment.where(status: "active").update_all(status: "rolled_back")
      deployment.update!(status: "active", traffic_percentage: 100)
    end

    Rails.logger.info("Model #{deployment.model_version} promoted to active.")
  end
end

The by_evaluator breakdown in the summary matters. A model can have a good average score overall but fail badly on faithfulness specifically — which is a red flag you would miss if you only looked at the aggregate.

Rolling back

Sometimes things go wrong in production even after a model passes evals. You want to be able to roll back fast:

class ModelRollbackJob < ApplicationJob
  queue_as :evals

  def perform(model_version:)
    ActiveRecord::Base.transaction do
      current = ModelDeployment.find_by(model_version: model_version)

      unless current
        Rails.logger.error("Rollback failed: no deployment found for #{model_version}")
        return
      end

      current.update!(status: "rolled_back", traffic_percentage: 0)

      previous = ModelDeployment.where(status: "rolled_back")
                                .where.not(model_version: model_version)
                                .order(updated_at: :desc)
                                .first

      unless previous
        Rails.logger.error(
          "Rollback failed: no previous deployment to restore. System has no active model."
        )
        raise "No previous deployment available for rollback"
      end

      previous.update!(status: "active", traffic_percentage: 100)
      Rails.logger.warn(
        "Model #{model_version} rolled back. Restored: #{previous.model_version}"
      )
    end
  end
end

The transaction ensures you never end up in a state where the current model is rolled back but nothing is restored. If there is no previous deployment to fall back to, we raise explicitly rather than silently leaving the system without an active model.

Putting it all together

The flow looks like this: you register a new model version as a candidate with status: "approved" and a traffic percentage — say ten percent. The router starts sending a portion of real traffic to it.

The eval pipeline accumulates scores in the background, scoped to the deployment window. Once you hit the minimum sample size, the promotion job decides automatically whether to promote to one hundred percent or reject.

If something goes wrong after promotion, rollback restores the previous version in a single transaction.

None of this required Python. No LangChain, no separate microservice, no new infrastructure.

Just Rails, Active Job, Postgres, and the evaluators we built in part one.

Shipping model changes is hard. But making it safe, measurable, and reversible — that is an engineering problem.

And engineering problems are what Rails is good at.

DEV Community