*This is Part 2 of a 3-part series* on building production-ready vector search for enterprise SaaS.
Part 1: Architecture & Implementation - Multi-tenant document processing
*Part 2: Production Resilience & Monitoring* 👈 You are here
Part 3: Cost Optimization & Lessons Learned (Coming Friday)
*TL;DR*: Production-ready means more than just working code. This part covers circuit breakers, rate limiting, health checks, and monitoring that keep the system running through Black Friday traffic spikes and Qdrant outages.
The Black Friday Incident 🚨
*Date:* November 24, 2023, 2:47 PM EST
Our monitoring dashboard lit up red:
⚠️ Qdrant CPU: 98%
⚠️ Memory: 95%
⚠️ Query latency: 12,000ms (P95)
🔥 Error rate: 45%
*What happened:* A major client launched their compliance platform to 5,000 users simultaneously. Search traffic spiked from 800 req/min → 4,200 req/min. Our Qdrant cluster couldn't handle it.
*Without circuit breakers:* We would have:
Hammered the failing Qdrant cluster
Exhausted connection pools
Brought down the entire Rails app
100% outage for all 150 clients
*With circuit breakers:*
Detected 5 failures in 30 seconds
Opened the circuit
Served cached results (42% hit rate)
Fell back to PostgreSQL full-text search
*99.2% of searches still worked*
MTTR: 12 minutes
Let me show you how we built this resilience.
Part 1: Circuit Breaker Pattern
The Implementation
# lib/circuit_breaker_registry.rb
class CircuitBreakerRegistry
@breakers = Concurrent::Map.new
def self.for_tenant(tenant_id)
@breakers.compute_if_absent("tenant_#{tenant_id}") do
Vectra::CircuitBreaker.new(
name: "tenant_#{tenant_id}",
failure_threshold: 5, # Open after 5 failures
recovery_timeout: 60, # Try again after 60s
success_threshold: 3 # Close after 3 successes
)
end
end
def self.stats
@breakers.each_pair.map do |name, breaker|
[name, breaker.stats]
end.to_h
end
def self.reset_all!
@breakers.each_value(&:reset!)
end
end
Using Circuit Breakers in Vector Search
# app/services/vector_indexing_service.rb
class VectorIndexingService
def search(tenant_id:, query:, filters: {}, limit: 20)
query_embedding = EmbeddingService.new.generate(query)
namespace = namespace_for_tenant(tenant_id)
# Get circuit breaker for this tenant
breaker = CircuitBreakerRegistry.for_tenant(tenant_id)
# Execute with fallback
breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do
@client.query(
index: 'compliance_documents',
vector: query_embedding,
top_k: limit,
namespace: namespace,
filter: filters,
include_metadata: true
)
end
end
private
def fallback_search(tenant_id, query, limit)
# Fallback to PostgreSQL full-text search
Document.where(tenant_id: tenant_id)
.where("to_tsvector('english', title || ' ' || content) @@ plainto_tsquery(?)", query)
.limit(limit)
.map { |doc| convert_to_vector_result(doc) }
end
def convert_to_vector_result(doc)
# Convert Document to QueryResult format
OpenStruct.new(
id: doc.id,
score: 0.5, # Fallback score
metadata: {
'document_id' => doc.id,
'tenant_id' => doc.tenant_id,
'title' => doc.title,
'chunk_text' => doc.content[0..500]
}
)
end
end
Circuit Breaker States
┌──────────┐
│ CLOSED │ ← Normal operation (requests pass through)
└──────────┘
│
│ (5 failures detected)
↓
┌──────────┐
│ OPEN │ ← Requests fail immediately, use fallback
└──────────┘
│
│ (60 seconds elapsed)
↓
┌──────────┐
│ HALF_OPEN│ ← Limited requests allowed to test recovery
└──────────┘
│
│ (3 successes) → Back to CLOSED
│ (1 failure) → Back to OPEN
Monitoring Circuit Breaker States
# app/controllers/admin/circuit_breakers_controller.rb
module Admin
class CircuitBreakersController < AdminController
def index
@breakers = CircuitBreakerRegistry.stats
render json: {
breakers: @breakers.map do |name, stats|
{
name: name,
state: stats[:state],
failures: stats[:failures],
successes: stats[:successes],
last_failure_at: stats[:last_failure_at],
opened_at: stats[:opened_at]
}
end,
summary: {
total: @breakers.size,
open: @breakers.count { |_, s| s[:state] == :open },
half_open: @breakers.count { |_, s| s[:state] == :half_open }
}
}
end
def reset
CircuitBreakerRegistry.reset_all!
render json: { message: 'All circuit breakers reset' }
end
end
end
Rate Limiting (Per-Tenant)
Why Per-Tenant Rate Limiting?
*Problem:* A single tenant running a poorly-configured script could DOS the entire platform.
*Solution:* Isolate rate limits per tenant.
# lib/rate_limiter_registry.rb
class RateLimiterRegistry
@limiters = Concurrent::Map.new
def self.for_tenant(tenant_id)
@limiters.compute_if_absent("tenant_#{tenant_id}") do
# 500 requests per minute per tenant
Vectra::RateLimiter.new(
requests_per_second: 8.33, # 500/60
burst_size: 20
)
end
end
def self.stats
@limiters.each_pair.map do |name, limiter|
[name, limiter.stats]
end.to_h
end
end
Using Rate Limiting
# app/services/vector_indexing_service.rb
def search(tenant_id:, query:, filters: {}, limit: 20)
# Get rate limiter for this tenant
limiter = RateLimiterRegistry.for_tenant(tenant_id)
# Acquire rate limit token
limiter.acquire do
# Get circuit breaker
breaker = CircuitBreakerRegistry.for_tenant(tenant_id)
# Execute search with resilience
breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do
perform_vector_search(query, filters, limit, tenant_id)
end
end
rescue Vectra::RateLimitError => e
# Return 429 Too Many Requests
raise SearchRateLimitExceeded.new(
tenant_id: tenant_id,
retry_after: limiter.retry_after
)
end
Rate Limit Headers
# app/controllers/api/v1/search_controller.rb
def create
limiter = RateLimiterRegistry.for_tenant(current_tenant.id)
# Add rate limit headers
response.headers['X-RateLimit-Limit'] = '500'
response.headers['X-RateLimit-Remaining'] = limiter.remaining.to_s
response.headers['X-RateLimit-Reset'] = limiter.reset_at.to_i.to_s
# ... search logic ...
rescue SearchRateLimitExceeded => e
response.headers['Retry-After'] = e.retry_after.to_s
render json: {
error: 'Rate limit exceeded',
retry_after: e.retry_after
}, status: :too_many_requests
end
Health Checks
Comprehensive Health Check Endpoint
# app/controllers/health_controller.rb
class HealthController < ApplicationController
skip_before_action :authenticate_user!
def show
health = {
status: 'healthy',
timestamp: Time.current.iso8601,
version: Vectra::VERSION,
services: {}
}
# Check Vector DB
begin
vectra_client = Vectra.qdrant(host: ENV['QDRANT_HOST'])
vectra_health = vectra_client.health_check
health[:services][:vector_db] = {
status: vectra_health.healthy? ? 'up' : 'down',
latency_ms: vectra_health.latency_ms,
provider: vectra_health.provider,
indexes: vectra_health.indexes_available
}
rescue StandardError => e
health[:services][:vector_db] = {
status: 'down',
error: e.message
}
health[:status] = 'degraded'
end
# Check Database
begin
start = Time.current
ActiveRecord::Base.connection.execute('SELECT 1')
db_latency = ((Time.current - start) * 1000).round(2)
health[:services][:database] = {
status: 'up',
latency_ms: db_latency
}
rescue StandardError => e
health[:services][:database] = {
status: 'down',
error: e.message
}
health[:status] = 'unhealthy'
end
# Check Redis
begin
start = Time.current
Rails.cache.redis.ping
redis_latency = ((Time.current - start) * 1000).round(2)
health[:services][:redis] = {
status: 'up',
latency_ms: redis_latency
}
rescue StandardError => e
health[:services][:redis] = {
status: 'down',
error: e.message
}
health[:status] = 'degraded'
end
# Check Embedding Service
begin
start = Time.current
EmbeddingService.new.generate('test')
embedding_latency = ((Time.current - start) * 1000).round(2)
health[:services][:embeddings] = {
status: 'up',
latency_ms: embedding_latency
}
rescue StandardError => e
health[:services][:embeddings] = {
status: 'down',
error: e.message
}
health[:status] = 'degraded'
end
# Check Sidekiq
begin
queue_size = Sidekiq::Queue.new('document_processing').size
health[:services][:sidekiq] = {
status: queue_size < 1000 ? 'up' : 'degraded',
queue_size: queue_size
}
rescue StandardError => e
health[:services][:sidekiq] = {
status: 'down',
error: e.message
}
end
status_code = case health[:status]
when 'healthy' then 200
when 'degraded' then 200
when 'unhealthy' then 503
end
render json: health, status: status_code
end
end
Kubernetes Liveness & Readiness Probes
# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vector-search-api
spec:
template:
spec:
containers:
- name: rails
image: vector-search-api:latest
ports:
- containerPort: 3000
# Liveness probe (restart if unhealthy)
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe (remove from load balancer if not ready)
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
Prometheus Metrics & Monitoring
Setting Up Prometheus
# config/initializers/prometheus.rb
require 'prometheus/client'
require 'prometheus/client/formats/text'
# Global prometheus registry
PROMETHEUS = Prometheus::Client.registry
# Vector search metrics
VECTOR_SEARCH_DURATION = PROMETHEUS.histogram(
:vector_search_duration_seconds,
docstring: 'Vector search duration in seconds',
labels: [:tenant_id, :status],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)
VECTOR_SEARCH_TOTAL = PROMETHEUS.counter(
:vector_search_total,
docstring: 'Total vector searches',
labels: [:tenant_id, :status]
)
VECTOR_SEARCH_CACHE_HIT = PROMETHEUS.counter(
:vector_search_cache_hit_total,
docstring: 'Cache hits for vector searches',
labels: [:tenant_id]
)
# Circuit breaker metrics
CIRCUIT_BREAKER_STATE = PROMETHEUS.gauge(
:circuit_breaker_state,
docstring: 'Circuit breaker state (0=closed, 1=open, 2=half_open)',
labels: [:tenant_id]
)
CIRCUIT_BREAKER_FAILURES = PROMETHEUS.counter(
:circuit_breaker_failures_total,
docstring: 'Total circuit breaker failures',
labels: [:tenant_id]
)
# Document processing metrics
DOCUMENT_PROCESSING_DURATION = PROMETHEUS.histogram(
:document_processing_duration_seconds,
docstring: 'Document processing duration in seconds',
labels: [:status],
buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]
)
DOCUMENT_PROCESSING_TOTAL = PROMETHEUS.counter(
:document_processing_total,
docstring: 'Total documents processed',
labels: [:status]
)
# Sidekiq queue metrics
SIDEKIQ_QUEUE_SIZE = PROMETHEUS.gauge(
:sidekiq_queue_size,
docstring: 'Current Sidekiq queue size',
labels: [:queue]
)
Instrumenting Vector Search
# app/services/vector_indexing_service.rb
def search(tenant_id:, query:, filters: {}, limit: 20)
start_time = Time.current
begin
# Check cache first
cache_key = generate_cache_key(query, filters, limit)
cached_result = Rails.cache.read(cache_key)
if cached_result
VECTOR_SEARCH_CACHE_HIT.increment(labels: { tenant_id: tenant_id })
return cached_result
end
# Perform search with resilience
limiter = RateLimiterRegistry.for_tenant(tenant_id)
breaker = CircuitBreakerRegistry.for_tenant(tenant_id)
result = limiter.acquire do
breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do
perform_vector_search(query, filters, limit, tenant_id)
end
end
# Cache result
Rails.cache.write(cache_key, result, expires_in: 1.hour)
# Record success
duration = Time.current - start_time
VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'success' })
VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'success' })
result
rescue StandardError => e
# Record failure
duration = Time.current - start_time
VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'error' })
VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'error' })
raise
end
end
Metrics Endpoint
# config/routes.rb
Rails.application.routes.draw do
get '/metrics', to: 'metrics#show'
end
# app/controllers/metrics_controller.rb
class MetricsController < ApplicationController
skip_before_action :authenticate_user!
def show
# Update circuit breaker metrics
CircuitBreakerRegistry.stats.each do |name, stats|
state_value = case stats[:state]
when :closed then 0
when :open then 1
when :half_open then 2
end
CIRCUIT_BREAKER_STATE.set(state_value, labels: { tenant_id: name })
end
# Update Sidekiq queue metrics
Sidekiq::Queue.all.each do |queue|
SIDEKIQ_QUEUE_SIZE.set(queue.size, labels: { queue: queue.name })
end
# Return Prometheus format
render plain: Prometheus::Client::Formats::Text.marshal(PROMETHEUS),
content_type: Prometheus::Client::Formats::Text::CONTENT_TYPE
end
end
Grafana Dashboards
Key Panels
*1. Search Performance*
# P50 latency
histogram_quantile(0.50,
rate(vector_search_duration_seconds_bucket[5m]))
# P95 latency
histogram_quantile(0.95,
rate(vector_search_duration_seconds_bucket[5m]))
# P99 latency
histogram_quantile(0.99,
rate(vector_search_duration_seconds_bucket[5m]))
*2. Throughput*
# Requests per second
rate(vector_search_total[1m])
# By tenant
sum(rate(vector_search_total[5m])) by (tenant_id)
*3. Cache Hit Rate*
# Cache hit rate percentage
100 * (
rate(vector_search_cache_hit_total[5m]) /
rate(vector_search_total[5m])
)
*4. Circuit Breaker Status*
# Count of open circuits
count(circuit_breaker_state == 1)
# Circuit state by tenant
circuit_breaker_state
*5. Error Rate*
# Error percentage
100 * (
rate(vector_search_total{status="error"}[5m]) /
rate(vector_search_total[5m])
)
Dashboard JSON
See full dashboard configuration: examples/grafana-dashboard.json in the Vectra repo.
Production Metrics (Real Numbers)
After implementing resilience patterns:
Search Latency:
- P50: 45ms (cached: 3ms)
- P95: 120ms (cached: 8ms)
- P99: 250ms
Availability:
- Uptime: 99.94% (exceeding 99.9% SLA)
- MTTR: 8 minutes average
- Longest incident: 45 minutes (Qdrant upgrade)
Circuit Breaker Stats:
- Opened: 12 times in 2024
- Average recovery time: 2.5 minutes
- Prevented outages: 8 incidents
Cache Performance:
- Hit rate: 42%
- Memory usage: 2.1GB
- Evictions: 145/day
Rate Limiting:
- Triggered: 234 times/month
- Top abuser: 1,847 requests in 1 minute
- No successful DOS attacks
What We've Covered
✅ *Circuit Breakers* - Prevented 8 outages in 2024
✅ *Rate Limiting* - Per-tenant isolation, no DOS attacks
✅ *Health Checks* - Kubernetes-ready monitoring
✅ *Prometheus Metrics* - Real-time observability
✅ *Grafana Dashboards* - Beautiful visualizations
Coming in Part 3: Cost Optimization 💰
We've built a resilient system. Now let's talk money.
In *Part 3* (Friday), we'll cover:
*Infrastructure Cost Breakdown* - $600/month vs $3,200/month (81% savings)
*Cache Optimization* - 42% hit rate = 70% latency reduction
*Chunking Strategy* - Why 512 tokens with 128 overlap beats 1000 tokens
*Self-Hosting Embeddings* - $200/month savings vs OpenAI
*5 Key Lessons Learned* - What worked, what didn't
*Spoiler:* The biggest savings came from caching, not from switching providers.
Resources
*Vectra Gem*: github.com/stokry/vectra
*Grafana Dashboard*:
examples/grafana-dashboard.json*Prometheus Exporter*:
examples/prometheus-exporter.rb
Questions about production resilience? Share your own war stories in the comments!
Top comments (0)