Stokry

Posted on Jan 13

Building Enterprise Vector Search in Rails (Part 2/3): Production Resilience & Monitoring

#rails #ruby #vectordatabase

*This is Part 2 of a 3-part series* on building production-ready vector search for enterprise SaaS.

Part 1: Architecture & Implementation - Multi-tenant document processing

*Part 2: Production Resilience & Monitoring* 👈 You are here

Part 3: Cost Optimization & Lessons Learned (Coming Friday)

*TL;DR*: Production-ready means more than just working code. This part covers circuit breakers, rate limiting, health checks, and monitoring that keep the system running through Black Friday traffic spikes and Qdrant outages.

The Black Friday Incident 🚨

*Date:* November 24, 2023, 2:47 PM EST

Our monitoring dashboard lit up red:


⚠️  Qdrant CPU: 98%

⚠️  Memory: 95%

⚠️  Query latency: 12,000ms (P95)

🔥 Error rate: 45%

*What happened:* A major client launched their compliance platform to 5,000 users simultaneously. Search traffic spiked from 800 req/min → 4,200 req/min. Our Qdrant cluster couldn't handle it.

*Without circuit breakers:* We would have:

Hammered the failing Qdrant cluster
Exhausted connection pools
Brought down the entire Rails app
100% outage for all 150 clients

*With circuit breakers:*

Detected 5 failures in 30 seconds
Opened the circuit
Served cached results (42% hit rate)
Fell back to PostgreSQL full-text search
*99.2% of searches still worked*
MTTR: 12 minutes

Let me show you how we built this resilience.

Part 1: Circuit Breaker Pattern

The Implementation


# lib/circuit_breaker_registry.rb

class CircuitBreakerRegistry

@breakers = Concurrent::Map.new



def self.for_tenant(tenant_id)

@breakers.compute_if_absent("tenant_#{tenant_id}") do

Vectra::CircuitBreaker.new(

name: "tenant_#{tenant_id}",

failure_threshold: 5,  # Open after 5 failures

recovery_timeout: 60,  # Try again after 60s

success_threshold: 3 # Close after 3 successes

)

end

end



def self.stats

@breakers.each_pair.map do |name, breaker|

[name, breaker.stats]

end.to_h

end



def self.reset_all!

@breakers.each_value(&:reset!)

end

end

Using Circuit Breakers in Vector Search


# app/services/vector_indexing_service.rb

class VectorIndexingService

def search(tenant_id:, query:, filters: {}, limit: 20)

query_embedding = EmbeddingService.new.generate(query)

namespace = namespace_for_tenant(tenant_id)



# Get circuit breaker for this tenant

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



# Execute with fallback

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

@client.query(

index: 'compliance_documents',

vector: query_embedding,

top_k: limit,

namespace: namespace,

filter: filters,

include_metadata: true

)

end

end



private



def fallback_search(tenant_id, query, limit)

# Fallback to PostgreSQL full-text search

Document.where(tenant_id: tenant_id)

.where("to_tsvector('english', title || ' ' || content) @@ plainto_tsquery(?)", query)

.limit(limit)

.map { |doc| convert_to_vector_result(doc) }

end



def convert_to_vector_result(doc)

# Convert Document to QueryResult format

OpenStruct.new(

id: doc.id,

score: 0.5,  # Fallback score

metadata: {

'document_id' => doc.id,

'tenant_id' => doc.tenant_id,

'title' => doc.title,

'chunk_text' => doc.content[0..500]

}

)

end

end

Circuit Breaker States


┌──────────┐

│  CLOSED  │  ← Normal operation (requests pass through)

└──────────┘

│

│ (5 failures detected)

↓

┌──────────┐

│ OPEN │  ← Requests fail immediately, use fallback

└──────────┘

│

│ (60 seconds elapsed)

↓

┌──────────┐

│ HALF_OPEN│  ← Limited requests allowed to test recovery

└──────────┘

│

│ (3 successes) → Back to CLOSED

│ (1 failure) → Back to OPEN

Monitoring Circuit Breaker States


# app/controllers/admin/circuit_breakers_controller.rb

module Admin

class CircuitBreakersController < AdminController

def index

@breakers = CircuitBreakerRegistry.stats



render json: {

breakers: @breakers.map do |name, stats|

{

name: name,

state: stats[:state],

failures: stats[:failures],

successes: stats[:successes],

last_failure_at: stats[:last_failure_at],

opened_at: stats[:opened_at]

}

end,

summary: {

total: @breakers.size,

open: @breakers.count { |_, s| s[:state] == :open },

half_open: @breakers.count { |_, s| s[:state] == :half_open }

}

}

end



def reset

CircuitBreakerRegistry.reset_all!

render json: { message: 'All circuit breakers reset' }

end

end

end

Rate Limiting (Per-Tenant)

Why Per-Tenant Rate Limiting?

*Problem:* A single tenant running a poorly-configured script could DOS the entire platform.

*Solution:* Isolate rate limits per tenant.


# lib/rate_limiter_registry.rb

class RateLimiterRegistry

@limiters = Concurrent::Map.new



def self.for_tenant(tenant_id)

@limiters.compute_if_absent("tenant_#{tenant_id}") do

# 500 requests per minute per tenant

Vectra::RateLimiter.new(

requests_per_second: 8.33,  # 500/60

burst_size: 20

)

end

end



def self.stats

@limiters.each_pair.map do |name, limiter|

[name, limiter.stats]

end.to_h

end

end

Using Rate Limiting


# app/services/vector_indexing_service.rb

def search(tenant_id:, query:, filters: {}, limit: 20)

# Get rate limiter for this tenant

limiter = RateLimiterRegistry.for_tenant(tenant_id)



# Acquire rate limit token

limiter.acquire do

# Get circuit breaker

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



# Execute search with resilience

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

perform_vector_search(query, filters, limit, tenant_id)

end

end

rescue Vectra::RateLimitError => e

# Return 429 Too Many Requests

raise SearchRateLimitExceeded.new(

tenant_id: tenant_id,

retry_after: limiter.retry_after

)

end

Rate Limit Headers


# app/controllers/api/v1/search_controller.rb

def create

limiter = RateLimiterRegistry.for_tenant(current_tenant.id)



# Add rate limit headers

response.headers['X-RateLimit-Limit'] = '500'

response.headers['X-RateLimit-Remaining'] = limiter.remaining.to_s

response.headers['X-RateLimit-Reset'] = limiter.reset_at.to_i.to_s



# ... search logic ...

rescue SearchRateLimitExceeded => e

response.headers['Retry-After'] = e.retry_after.to_s

render json: {

error: 'Rate limit exceeded',

retry_after: e.retry_after

}, status: :too_many_requests

end

Health Checks

Comprehensive Health Check Endpoint


# app/controllers/health_controller.rb

class HealthController < ApplicationController

skip_before_action :authenticate_user!



def show

health = {

status: 'healthy',

timestamp: Time.current.iso8601,

version: Vectra::VERSION,

services: {}

}



# Check Vector DB

begin

vectra_client = Vectra.qdrant(host: ENV['QDRANT_HOST'])

vectra_health = vectra_client.health_check



health[:services][:vector_db] = {

status: vectra_health.healthy? ? 'up' : 'down',

latency_ms: vectra_health.latency_ms,

provider: vectra_health.provider,

indexes: vectra_health.indexes_available

}

rescue StandardError => e

health[:services][:vector_db] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Database

begin

start = Time.current

ActiveRecord::Base.connection.execute('SELECT 1')

db_latency = ((Time.current - start) * 1000).round(2)



health[:services][:database] = {

status: 'up',

latency_ms: db_latency

}

rescue StandardError => e

health[:services][:database] = {

status: 'down',

error: e.message

}

health[:status] = 'unhealthy'

end



# Check Redis

begin

start = Time.current

Rails.cache.redis.ping

redis_latency = ((Time.current - start) * 1000).round(2)



health[:services][:redis] = {

status: 'up',

latency_ms: redis_latency

}

rescue StandardError => e

health[:services][:redis] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Embedding Service

begin

start = Time.current

EmbeddingService.new.generate('test')

embedding_latency = ((Time.current - start) * 1000).round(2)



health[:services][:embeddings] = {

status: 'up',

latency_ms: embedding_latency

}

rescue StandardError => e

health[:services][:embeddings] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Sidekiq

begin

queue_size = Sidekiq::Queue.new('document_processing').size



health[:services][:sidekiq] = {

status: queue_size < 1000 ? 'up' : 'degraded',

queue_size: queue_size

}

rescue StandardError => e

health[:services][:sidekiq] = {

status: 'down',

error: e.message

}

end



status_code = case health[:status]

when 'healthy' then 200

when 'degraded' then 200

when 'unhealthy' then 503

end



render json: health, status: status_code

end

end

Kubernetes Liveness & Readiness Probes


# kubernetes/deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

name: vector-search-api

spec:

template:

spec:

containers:

- name: rails

image: vector-search-api:latest

ports:

- containerPort: 3000



# Liveness probe (restart if unhealthy)

livenessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3



# Readiness probe (remove from load balancer if not ready)

readinessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 2

Prometheus Metrics & Monitoring

Setting Up Prometheus


# config/initializers/prometheus.rb

require 'prometheus/client'

require 'prometheus/client/formats/text'



# Global prometheus registry

PROMETHEUS = Prometheus::Client.registry



# Vector search metrics

VECTOR_SEARCH_DURATION = PROMETHEUS.histogram(

:vector_search_duration_seconds,

docstring: 'Vector search duration in seconds',

labels: [:tenant_id, :status],

buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

)



VECTOR_SEARCH_TOTAL = PROMETHEUS.counter(

:vector_search_total,

docstring: 'Total vector searches',

labels: [:tenant_id, :status]

)



VECTOR_SEARCH_CACHE_HIT = PROMETHEUS.counter(

:vector_search_cache_hit_total,

docstring: 'Cache hits for vector searches',

labels: [:tenant_id]

)



# Circuit breaker metrics

CIRCUIT_BREAKER_STATE = PROMETHEUS.gauge(

:circuit_breaker_state,

docstring: 'Circuit breaker state (0=closed, 1=open, 2=half_open)',

labels: [:tenant_id]

)



CIRCUIT_BREAKER_FAILURES = PROMETHEUS.counter(

:circuit_breaker_failures_total,

docstring: 'Total circuit breaker failures',

labels: [:tenant_id]

)



# Document processing metrics

DOCUMENT_PROCESSING_DURATION = PROMETHEUS.histogram(

:document_processing_duration_seconds,

docstring: 'Document processing duration in seconds',

labels: [:status],

buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

)



DOCUMENT_PROCESSING_TOTAL = PROMETHEUS.counter(

:document_processing_total,

docstring: 'Total documents processed',

labels: [:status]

)



# Sidekiq queue metrics

SIDEKIQ_QUEUE_SIZE = PROMETHEUS.gauge(

:sidekiq_queue_size,

docstring: 'Current Sidekiq queue size',

labels: [:queue]

)

Instrumenting Vector Search


# app/services/vector_indexing_service.rb

def search(tenant_id:, query:, filters: {}, limit: 20)

start_time = Time.current



begin

# Check cache first

cache_key = generate_cache_key(query, filters, limit)

cached_result = Rails.cache.read(cache_key)



if cached_result

VECTOR_SEARCH_CACHE_HIT.increment(labels: { tenant_id: tenant_id })

return cached_result

end



# Perform search with resilience

limiter = RateLimiterRegistry.for_tenant(tenant_id)

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



result = limiter.acquire do

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

perform_vector_search(query, filters, limit, tenant_id)

end

end



# Cache result

Rails.cache.write(cache_key, result, expires_in: 1.hour)



# Record success

duration = Time.current - start_time

VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'success' })

VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'success' })



result

rescue StandardError => e

# Record failure

duration = Time.current - start_time

VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'error' })

VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'error' })



raise

end

end

Metrics Endpoint


# config/routes.rb

Rails.application.routes.draw do

get '/metrics', to: 'metrics#show'

end



# app/controllers/metrics_controller.rb

class MetricsController < ApplicationController

skip_before_action :authenticate_user!



def show

# Update circuit breaker metrics

CircuitBreakerRegistry.stats.each do |name, stats|

state_value = case stats[:state]

when :closed then 0

when :open then 1

when :half_open then 2

end



CIRCUIT_BREAKER_STATE.set(state_value, labels: { tenant_id: name })

end



# Update Sidekiq queue metrics

Sidekiq::Queue.all.each do |queue|

SIDEKIQ_QUEUE_SIZE.set(queue.size, labels: { queue: queue.name })

end



# Return Prometheus format

render plain: Prometheus::Client::Formats::Text.marshal(PROMETHEUS),

content_type: Prometheus::Client::Formats::Text::CONTENT_TYPE

end

end

Grafana Dashboards

Key Panels

*1. Search Performance*


# P50 latency

histogram_quantile(0.50,

rate(vector_search_duration_seconds_bucket[5m]))



# P95 latency

histogram_quantile(0.95,

rate(vector_search_duration_seconds_bucket[5m]))



# P99 latency

histogram_quantile(0.99,

rate(vector_search_duration_seconds_bucket[5m]))

*2. Throughput*


# Requests per second

rate(vector_search_total[1m])



# By tenant

sum(rate(vector_search_total[5m])) by (tenant_id)

*3. Cache Hit Rate*


# Cache hit rate percentage

100 * (

rate(vector_search_cache_hit_total[5m]) /

rate(vector_search_total[5m])

)

*4. Circuit Breaker Status*


# Count of open circuits

count(circuit_breaker_state == 1)



# Circuit state by tenant

circuit_breaker_state

*5. Error Rate*


# Error percentage

100 * (

rate(vector_search_total{status="error"}[5m]) /

rate(vector_search_total[5m])

)

Dashboard JSON

See full dashboard configuration: examples/grafana-dashboard.json in the Vectra repo.

Production Metrics (Real Numbers)

After implementing resilience patterns:


Search Latency:

- P50: 45ms (cached: 3ms)

- P95: 120ms (cached: 8ms)

- P99: 250ms



Availability:

- Uptime: 99.94% (exceeding 99.9% SLA)

- MTTR: 8 minutes average

- Longest incident: 45 minutes (Qdrant upgrade)



Circuit Breaker Stats:

- Opened: 12 times in 2024

- Average recovery time: 2.5 minutes

- Prevented outages: 8 incidents



Cache Performance:

- Hit rate: 42%

- Memory usage: 2.1GB

- Evictions: 145/day



Rate Limiting:

- Triggered: 234 times/month

- Top abuser: 1,847 requests in 1 minute

- No successful DOS attacks

What We've Covered

✅ *Circuit Breakers* - Prevented 8 outages in 2024

✅ *Rate Limiting* - Per-tenant isolation, no DOS attacks

✅ *Health Checks* - Kubernetes-ready monitoring

✅ *Prometheus Metrics* - Real-time observability

✅ *Grafana Dashboards* - Beautiful visualizations

Coming in Part 3: Cost Optimization 💰

We've built a resilient system. Now let's talk money.

In *Part 3* (Friday), we'll cover:

*Infrastructure Cost Breakdown* - $600/month vs $3,200/month (81% savings)
*Cache Optimization* - 42% hit rate = 70% latency reduction
*Chunking Strategy* - Why 512 tokens with 128 overlap beats 1000 tokens
*Self-Hosting Embeddings* - $200/month savings vs OpenAI
*5 Key Lessons Learned* - What worked, what didn't

*Spoiler:* The biggest savings came from caching, not from switching providers.

Resources

*Vectra Gem*: github.com/stokry/vectra
*Grafana Dashboard*: examples/grafana-dashboard.json
*Prometheus Exporter*: examples/prometheus-exporter.rb

Questions about production resilience? Share your own war stories in the comments!

DEV Community