Welcome back to the Ruby for AI series. You've built AI features — chat interfaces, RAG pipelines, image generation, voice transcription. Now comes the part most tutorials skip: how do you test code that returns different results every time you call it?
AI outputs are non-deterministic. Ask the same question twice, get two different answers. Traditional "assert equals" testing breaks down. But that doesn't mean you skip tests. It means you test smarter.
Let's set up a proper testing strategy for AI-powered Rails apps.
Setting Up RSpec
If you haven't already:
bundle add rspec-rails --group "development, test"
bundle add webmock --group test
bundle add vcr --group test
rails generate rspec:install
Add to spec/rails_helper.rb:
require 'webmock/rspec'
require 'vcr'
VCR.configure do |config|
config.cassette_library_dir = 'spec/cassettes'
config.hook_into :webmock
config.filter_sensitive_data('<OPENAI_KEY>') { ENV['OPENAI_API_KEY'] }
config.default_cassette_options = { record: :once }
end
Strategy 1: Record Real API Calls with VCR
VCR records HTTP interactions and replays them. One real API call, deterministic tests forever after:
# spec/services/chat_service_spec.rb
RSpec.describe ChatService do
describe '#ask' do
it 'returns a response about Ruby' do
VCR.use_cassette('chat_about_ruby') do
service = ChatService.new
response = service.ask("What is Ruby's main strength?")
expect(response).to be_a(String)
expect(response.length).to be > 20
expect(response.downcase).to match(/ruby|programming|language/)
end
end
end
end
The first run hits the real API and saves the response. Every run after that uses the recording. Deterministic, fast, free.
Strategy 2: Stub API Responses
For unit tests, skip the API entirely:
# spec/services/embedding_service_spec.rb
RSpec.describe EmbeddingService do
let(:fake_embedding) { Array.new(1536) { rand(-1.0..1.0) } }
before do
stub_request(:post, 'https://api.openai.com/v1/embeddings')
.to_return(
status: 200,
body: {
data: [{ embedding: fake_embedding, index: 0 }],
usage: { prompt_tokens: 8, total_tokens: 8 }
}.to_json,
headers: { 'Content-Type' => 'application/json' }
)
end
it 'returns a 1536-dimension vector' do
result = EmbeddingService.new.embed("test input")
expect(result.length).to eq(1536)
expect(result).to all(be_a(Float))
end
end
Strategy 3: Test the Shape, Not the Content
AI responses vary in wording but follow patterns. Test structure:
# spec/services/ai_agent_spec.rb
RSpec.describe AiAgent do
describe '#execute' do
it 'returns a properly structured response' do
VCR.use_cassette('agent_tool_call') do
result = AiAgent.new.execute("Look up the weather in Tokyo")
expect(result).to include(:answer, :tools_used, :token_count)
expect(result[:tools_used]).to be_an(Array)
expect(result[:token_count]).to be_positive
end
end
end
end
For JSON responses from AI, validate the schema:
it 'returns valid JSON with required fields' do
VCR.use_cassette('structured_output') do
result = service.analyze("Summarize this document")
parsed = JSON.parse(result)
expect(parsed).to have_key('summary')
expect(parsed).to have_key('key_points')
expect(parsed['key_points']).to be_an(Array)
end
end
Strategy 4: Boundary and Error Tests
Test what happens when the API fails — this is where most AI apps break in production:
describe 'error handling' do
it 'handles rate limiting gracefully' do
stub_request(:post, 'https://api.openai.com/v1/chat/completions')
.to_return(status: 429, body: { error: { message: 'Rate limit exceeded' } }.to_json)
expect { service.ask("anything") }.not_to raise_error
expect(service.ask("anything")).to eq("Service temporarily unavailable. Please try again.")
end
it 'handles timeout' do
stub_request(:post, 'https://api.openai.com/v1/chat/completions')
.to_timeout
result = service.ask("anything")
expect(result).to include("unavailable")
end
it 'handles malformed API response' do
stub_request(:post, 'https://api.openai.com/v1/chat/completions')
.to_return(status: 200, body: 'not json at all')
expect { service.ask("anything") }.not_to raise_error
end
end
Strategy 5: Contract Tests for Prompts
When you change a prompt, you want to know if it still produces the right kind of output:
# spec/prompts/summarizer_prompt_spec.rb
RSpec.describe 'Summarizer prompt' do
it 'produces output under 200 words' do
VCR.use_cassette('summarizer_long_doc') do
result = Summarizer.new.summarize(long_document)
word_count = result.split.size
expect(word_count).to be <= 200
end
end
it 'preserves key entities from input' do
VCR.use_cassette('summarizer_entity_check') do
input = "The Ruby programming language was created by Yukihiro Matsumoto in 1995."
result = Summarizer.new.summarize(input)
expect(result.downcase).to include('ruby')
expect(result.downcase).to include('matsumoto')
end
end
end
Running Your Test Suite
# Run all specs
bundle exec rspec
# Run only AI-related specs
bundle exec rspec spec/services/
# Run with documentation format
bundle exec rspec --format documentation
The Testing Pyramid for AI Apps
Here's how I think about it:
- Unit tests (stubbed): 70% — Fast, test your code logic, error handling, data transformations
- Integration tests (VCR): 25% — Record real API calls once, replay forever
- Live API tests (tagged): 5% — Run occasionally against real API to catch drift
Tag your live tests so they don't run in CI:
it 'gets a real response', :live_api do
result = service.ask("Say hello")
expect(result).to be_present
end
# Skip live API tests in CI
bundle exec rspec --tag ~live_api
What's Next
Your AI features are tested. Next up: making them fast. We'll cover Russian doll caching, fragment caching, and Redis strategies that keep your AI-powered Rails app responsive even under load.
Top comments (0)