From Eval to Production: A Ruby and Rails Approach
If you read the first article, you now have a set of evaluators that can score your LLM responses — semantic similarity, LLM-as-judge, faithfulness, answer relevancy, context precision. You have a model_version column in your eval_results table. You are storing scores over time.
Now what? How do you actually use all of this to make shipping decisions?
The problem with shipping model changes
Deploying a new model version is not like deploying a code change. With code, you can read a diff. You can write a test that covers the change. You can be reasonably confident that if your test suite passes, you did not break anything.
With models it is different. You fine-tune, you run some manual tests, things look good, you ship — and then three days later a user reports that the assistant is giving wrong answers on a specific type of question you never thought to test.
Sound familiar?
The eval pipeline exists to fix exactly this. But only if you wire it into your deployment process.
Tracking model deployments
First, you need a way to track which model version is active and what percentage of traffic it is serving:
class CreateModelDeployments < ActiveRecord::Migration[8.0]
def change
create_table :model_deployments do |t|
t.string :model_version, null: false
t.string :status, null: false, default: "pending"
t.integer :traffic_percentage, null: false, default: 0
t.float :min_eval_score, null: false, default: 0.8
t.jsonb :eval_summary, default: {}
t.timestamps
end
add_index :model_deployments, :status
add_index :model_deployments, :model_version, unique: true
end
end
And the model:
class ModelDeployment < ApplicationRecord
STATUSES = %w[pending evaluating approved rejected active rolled_back].freeze
validates :status, inclusion: { in: STATUSES }
validates :traffic_percentage, inclusion: { in: 0..100 }
def self.active
find_by(status: "active")
end
def self.candidate
find_by(status: "approved")
end
end
Routing traffic between model versions
When you make an LLM call, you need to decide which model version to use. You route based on the current deployment state:
class ModelRouter
FALLBACK_MODEL = "openai/gpt-4o".freeze
def self.resolve(session_id:)
active = ModelDeployment.active
candidate = ModelDeployment.candidate
# no active deployment yet — use fallback
unless active
Rails.logger.warn("No active model deployment found. Using fallback: #{FALLBACK_MODEL}")
return FALLBACK_MODEL
end
# no candidate being tested — use active
return active.model_version unless candidate
# deterministic routing based on session_id
# same session always gets the same model version
bucket = Digest::MD5.hexdigest(session_id).to_i(16) % 100
bucket < candidate.traffic_percentage ? candidate.model_version : active.model_version
end
end
The deterministic routing is important — hashing the session ID means the same user always gets the same model version during the experiment, avoiding jarring inconsistencies mid-conversation.
Running evals asynchronously with Active Job
The eval pipeline job ties everything together. Because each evaluator has a different signature depending on what inputs it needs, we use a dispatcher pattern rather than calling them all with a generic argument list:
class EvalPipelineJob < ApplicationJob
queue_as :evals
def perform(prompt:, response:, model_version:, reference: nil, context: nil, context_chunks: nil)
client = OpenAI::Client.new(
access_token: Rails.application.credentials.openai.api_key
)
run_evaluator(LlmJudgeEvaluator, client,
prompt: prompt, response: response, model_version: model_version)
if reference.present?
run_evaluator(SemanticSimilarityEvaluator, client,
prompt: prompt, response: response, reference: reference, model_version: model_version)
end
if context.present?
run_evaluator(FaithfulnessEvaluator, client,
prompt: prompt, response: response, context: context, model_version: model_version)
end
run_evaluator(AnswerRelevancyEvaluator, client,
prompt: prompt, response: response, model_version: model_version)
if context_chunks.present?
run_evaluator(ContextPrecisionEvaluator, client,
prompt: prompt, response: response, context_chunks: context_chunks, model_version: model_version)
end
ModelPromotionJob.perform_later(model_version: model_version)
end
private
def run_evaluator(klass, client, **args)
klass.new(client).evaluate(**args)
rescue => e
Rails.logger.error("#{klass} failed: #{e.message}")
end
end
Each evaluator is only called when the inputs it needs are actually present. LlmJudgeEvaluator and AnswerRelevancyEvaluator run on every request since they only need the prompt and response. The others run when context or reference data is available.
In your controller, you never block the response waiting for evals:
class AssistantController < ApplicationController
def chat
model_version = ModelRouter.resolve(session_id: session.id)
context_chunks = KnowledgeBase.retrieve(params[:message])
context = context_chunks.join("\n")
response_text = llm_client.chat(
model: model_version,
messages: [{ role: "user", content: params[:message] }]
).dig("choices", 0, "message", "content")
EvalPipelineJob.perform_later(
prompt: params[:message],
response: response_text,
model_version: model_version,
context: context,
context_chunks: context_chunks
)
render json: { response: response_text }
end
end
Automated promotion
Once enough eval results have accumulated for a candidate model, the promotion job decides whether to promote or reject it. Critically, we only score results collected after the deployment was created — otherwise we risk mixing in results from a previous deployment that happened to use the same model identifier:
class ModelPromotionJob < ApplicationJob
queue_as :evals
MIN_SAMPLE_SIZE = 50
def perform(model_version:)
deployment = ModelDeployment.find_by(
model_version: model_version,
status: "approved"
)
return unless deployment
# only count results collected since this deployment was created
results = EvalResult.where(
model_version: model_version
).where("created_at >= ?", deployment.created_at)
return if results.count < MIN_SAMPLE_SIZE
summary = compute_summary(results)
deployment.update!(eval_summary: summary)
if summary[:average_score] >= deployment.min_eval_score &&
summary[:pass_rate] >= 0.85
promote!(deployment)
elsif summary[:average_score] < deployment.min_eval_score * 0.9
# significantly below threshold — reject early
deployment.update!(status: "rejected")
Rails.logger.warn(
"Model #{model_version} rejected. Score: #{summary[:average_score]}"
)
end
end
private
def compute_summary(results)
{
average_score: results.average(:score).to_f.round(3),
pass_rate: (results.where(passed: true).count.to_f / results.count).round(3),
sample_size: results.count,
by_evaluator: results.group(:evaluator_type)
.average(:score)
.transform_values { |v| v.round(3) }
}
end
def promote!(deployment)
ActiveRecord::Base.transaction do
ModelDeployment.where(status: "active").update_all(status: "rolled_back")
deployment.update!(status: "active", traffic_percentage: 100)
end
Rails.logger.info("Model #{deployment.model_version} promoted to active.")
end
end
The by_evaluator breakdown in the summary matters. A model can have a good average score overall but fail badly on faithfulness specifically — which is a red flag you would miss if you only looked at the aggregate.
Rolling back
Sometimes things go wrong in production even after a model passes evals. You want to be able to roll back fast:
class ModelRollbackJob < ApplicationJob
queue_as :evals
def perform(model_version:)
ActiveRecord::Base.transaction do
current = ModelDeployment.find_by(model_version: model_version)
unless current
Rails.logger.error("Rollback failed: no deployment found for #{model_version}")
return
end
current.update!(status: "rolled_back", traffic_percentage: 0)
previous = ModelDeployment.where(status: "rolled_back")
.where.not(model_version: model_version)
.order(updated_at: :desc)
.first
unless previous
Rails.logger.error(
"Rollback failed: no previous deployment to restore. System has no active model."
)
raise "No previous deployment available for rollback"
end
previous.update!(status: "active", traffic_percentage: 100)
Rails.logger.warn(
"Model #{model_version} rolled back. Restored: #{previous.model_version}"
)
end
end
end
The transaction ensures you never end up in a state where the current model is rolled back but nothing is restored. If there is no previous deployment to fall back to, we raise explicitly rather than silently leaving the system without an active model.
Putting it all together
The flow looks like this: you register a new model version as a candidate with status: "approved" and a traffic percentage — say ten percent. The router starts sending a portion of real traffic to it.
The eval pipeline accumulates scores in the background, scoped to the deployment window. Once you hit the minimum sample size, the promotion job decides automatically whether to promote to one hundred percent or reject.
If something goes wrong after promotion, rollback restores the previous version in a single transaction.
None of this required Python. No LangChain, no separate microservice, no new infrastructure.
Just Rails, Active Job, Postgres, and the evaluators we built in part one.
Shipping model changes is hard. But making it safe, measurable, and reversible — that is an engineering problem.
And engineering problems are what Rails is good at.
Top comments (0)