DEV Community

Cover image for Building Enterprise Vector Search in Rails (Part 2/3): Production Resilience & Monitoring
Stokry
Stokry

Posted on

Building Enterprise Vector Search in Rails (Part 2/3): Production Resilience & Monitoring

*This is Part 2 of a 3-part series* on building production-ready vector search for enterprise SaaS.

  • Part 1: Architecture & Implementation - Multi-tenant document processing

  • *Part 2: Production Resilience & Monitoring* 👈 You are here

  • Part 3: Cost Optimization & Lessons Learned (Coming Friday)

*TL;DR*: Production-ready means more than just working code. This part covers circuit breakers, rate limiting, health checks, and monitoring that keep the system running through Black Friday traffic spikes and Qdrant outages.


The Black Friday Incident 🚨

*Date:* November 24, 2023, 2:47 PM EST

Our monitoring dashboard lit up red:


⚠️  Qdrant CPU: 98%

⚠️  Memory: 95%

⚠️  Query latency: 12,000ms (P95)

🔥 Error rate: 45%

Enter fullscreen mode Exit fullscreen mode

*What happened:* A major client launched their compliance platform to 5,000 users simultaneously. Search traffic spiked from 800 req/min → 4,200 req/min. Our Qdrant cluster couldn't handle it.

*Without circuit breakers:* We would have:

  • Hammered the failing Qdrant cluster

  • Exhausted connection pools

  • Brought down the entire Rails app

  • 100% outage for all 150 clients

*With circuit breakers:*

  • Detected 5 failures in 30 seconds

  • Opened the circuit

  • Served cached results (42% hit rate)

  • Fell back to PostgreSQL full-text search

  • *99.2% of searches still worked*

  • MTTR: 12 minutes

Let me show you how we built this resilience.


Part 1: Circuit Breaker Pattern

The Implementation


# lib/circuit_breaker_registry.rb

class CircuitBreakerRegistry

@breakers = Concurrent::Map.new



def self.for_tenant(tenant_id)

@breakers.compute_if_absent("tenant_#{tenant_id}") do

Vectra::CircuitBreaker.new(

name: "tenant_#{tenant_id}",

failure_threshold: 5,  # Open after 5 failures

recovery_timeout: 60,  # Try again after 60s

success_threshold: 3 # Close after 3 successes

)

end

end



def self.stats

@breakers.each_pair.map do |name, breaker|

[name, breaker.stats]

end.to_h

end



def self.reset_all!

@breakers.each_value(&:reset!)

end

end

Enter fullscreen mode Exit fullscreen mode

Using Circuit Breakers in Vector Search


# app/services/vector_indexing_service.rb

class VectorIndexingService

def search(tenant_id:, query:, filters: {}, limit: 20)

query_embedding = EmbeddingService.new.generate(query)

namespace = namespace_for_tenant(tenant_id)



# Get circuit breaker for this tenant

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



# Execute with fallback

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

@client.query(

index: 'compliance_documents',

vector: query_embedding,

top_k: limit,

namespace: namespace,

filter: filters,

include_metadata: true

)

end

end



private



def fallback_search(tenant_id, query, limit)

# Fallback to PostgreSQL full-text search

Document.where(tenant_id: tenant_id)

.where("to_tsvector('english', title || ' ' || content) @@ plainto_tsquery(?)", query)

.limit(limit)

.map { |doc| convert_to_vector_result(doc) }

end



def convert_to_vector_result(doc)

# Convert Document to QueryResult format

OpenStruct.new(

id: doc.id,

score: 0.5,  # Fallback score

metadata: {

'document_id' => doc.id,

'tenant_id' => doc.tenant_id,

'title' => doc.title,

'chunk_text' => doc.content[0..500]

}

)

end

end

Enter fullscreen mode Exit fullscreen mode

Circuit Breaker States


┌──────────┐

│  CLOSED  │  ← Normal operation (requests pass through)

└──────────┘

│

│ (5 failures detected)

↓

┌──────────┐

│ OPEN │  ← Requests fail immediately, use fallback

└──────────┘

│

│ (60 seconds elapsed)

↓

┌──────────┐

│ HALF_OPEN│  ← Limited requests allowed to test recovery

└──────────┘

│

│ (3 successes) → Back to CLOSED

│ (1 failure) → Back to OPEN

Enter fullscreen mode Exit fullscreen mode

Monitoring Circuit Breaker States


# app/controllers/admin/circuit_breakers_controller.rb

module Admin

class CircuitBreakersController < AdminController

def index

@breakers = CircuitBreakerRegistry.stats



render json: {

breakers: @breakers.map do |name, stats|

{

name: name,

state: stats[:state],

failures: stats[:failures],

successes: stats[:successes],

last_failure_at: stats[:last_failure_at],

opened_at: stats[:opened_at]

}

end,

summary: {

total: @breakers.size,

open: @breakers.count { |_, s| s[:state] == :open },

half_open: @breakers.count { |_, s| s[:state] == :half_open }

}

}

end



def reset

CircuitBreakerRegistry.reset_all!

render json: { message: 'All circuit breakers reset' }

end

end

end

Enter fullscreen mode Exit fullscreen mode

Rate Limiting (Per-Tenant)

Why Per-Tenant Rate Limiting?

*Problem:* A single tenant running a poorly-configured script could DOS the entire platform.

*Solution:* Isolate rate limits per tenant.


# lib/rate_limiter_registry.rb

class RateLimiterRegistry

@limiters = Concurrent::Map.new



def self.for_tenant(tenant_id)

@limiters.compute_if_absent("tenant_#{tenant_id}") do

# 500 requests per minute per tenant

Vectra::RateLimiter.new(

requests_per_second: 8.33,  # 500/60

burst_size: 20

)

end

end



def self.stats

@limiters.each_pair.map do |name, limiter|

[name, limiter.stats]

end.to_h

end

end

Enter fullscreen mode Exit fullscreen mode

Using Rate Limiting


# app/services/vector_indexing_service.rb

def search(tenant_id:, query:, filters: {}, limit: 20)

# Get rate limiter for this tenant

limiter = RateLimiterRegistry.for_tenant(tenant_id)



# Acquire rate limit token

limiter.acquire do

# Get circuit breaker

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



# Execute search with resilience

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

perform_vector_search(query, filters, limit, tenant_id)

end

end

rescue Vectra::RateLimitError => e

# Return 429 Too Many Requests

raise SearchRateLimitExceeded.new(

tenant_id: tenant_id,

retry_after: limiter.retry_after

)

end

Enter fullscreen mode Exit fullscreen mode

Rate Limit Headers


# app/controllers/api/v1/search_controller.rb

def create

limiter = RateLimiterRegistry.for_tenant(current_tenant.id)



# Add rate limit headers

response.headers['X-RateLimit-Limit'] = '500'

response.headers['X-RateLimit-Remaining'] = limiter.remaining.to_s

response.headers['X-RateLimit-Reset'] = limiter.reset_at.to_i.to_s



# ... search logic ...

rescue SearchRateLimitExceeded => e

response.headers['Retry-After'] = e.retry_after.to_s

render json: {

error: 'Rate limit exceeded',

retry_after: e.retry_after

}, status: :too_many_requests

end

Enter fullscreen mode Exit fullscreen mode

Health Checks

Comprehensive Health Check Endpoint


# app/controllers/health_controller.rb

class HealthController < ApplicationController

skip_before_action :authenticate_user!



def show

health = {

status: 'healthy',

timestamp: Time.current.iso8601,

version: Vectra::VERSION,

services: {}

}



# Check Vector DB

begin

vectra_client = Vectra.qdrant(host: ENV['QDRANT_HOST'])

vectra_health = vectra_client.health_check



health[:services][:vector_db] = {

status: vectra_health.healthy? ? 'up' : 'down',

latency_ms: vectra_health.latency_ms,

provider: vectra_health.provider,

indexes: vectra_health.indexes_available

}

rescue StandardError => e

health[:services][:vector_db] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Database

begin

start = Time.current

ActiveRecord::Base.connection.execute('SELECT 1')

db_latency = ((Time.current - start) * 1000).round(2)



health[:services][:database] = {

status: 'up',

latency_ms: db_latency

}

rescue StandardError => e

health[:services][:database] = {

status: 'down',

error: e.message

}

health[:status] = 'unhealthy'

end



# Check Redis

begin

start = Time.current

Rails.cache.redis.ping

redis_latency = ((Time.current - start) * 1000).round(2)



health[:services][:redis] = {

status: 'up',

latency_ms: redis_latency

}

rescue StandardError => e

health[:services][:redis] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Embedding Service

begin

start = Time.current

EmbeddingService.new.generate('test')

embedding_latency = ((Time.current - start) * 1000).round(2)



health[:services][:embeddings] = {

status: 'up',

latency_ms: embedding_latency

}

rescue StandardError => e

health[:services][:embeddings] = {

status: 'down',

error: e.message

}

health[:status] = 'degraded'

end



# Check Sidekiq

begin

queue_size = Sidekiq::Queue.new('document_processing').size



health[:services][:sidekiq] = {

status: queue_size < 1000 ? 'up' : 'degraded',

queue_size: queue_size

}

rescue StandardError => e

health[:services][:sidekiq] = {

status: 'down',

error: e.message

}

end



status_code = case health[:status]

when 'healthy' then 200

when 'degraded' then 200

when 'unhealthy' then 503

end



render json: health, status: status_code

end

end

Enter fullscreen mode Exit fullscreen mode

Kubernetes Liveness & Readiness Probes


# kubernetes/deployment.yaml

apiVersion: apps/v1

kind: Deployment

metadata:

name: vector-search-api

spec:

template:

spec:

containers:

- name: rails

image: vector-search-api:latest

ports:

- containerPort: 3000



# Liveness probe (restart if unhealthy)

livenessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 30

periodSeconds: 10

timeoutSeconds: 5

failureThreshold: 3



# Readiness probe (remove from load balancer if not ready)

readinessProbe:

httpGet:

path: /health

port: 3000

initialDelaySeconds: 10

periodSeconds: 5

timeoutSeconds: 3

failureThreshold: 2

Enter fullscreen mode Exit fullscreen mode

Prometheus Metrics & Monitoring

Setting Up Prometheus


# config/initializers/prometheus.rb

require 'prometheus/client'

require 'prometheus/client/formats/text'



# Global prometheus registry

PROMETHEUS = Prometheus::Client.registry



# Vector search metrics

VECTOR_SEARCH_DURATION = PROMETHEUS.histogram(

:vector_search_duration_seconds,

docstring: 'Vector search duration in seconds',

labels: [:tenant_id, :status],

buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]

)



VECTOR_SEARCH_TOTAL = PROMETHEUS.counter(

:vector_search_total,

docstring: 'Total vector searches',

labels: [:tenant_id, :status]

)



VECTOR_SEARCH_CACHE_HIT = PROMETHEUS.counter(

:vector_search_cache_hit_total,

docstring: 'Cache hits for vector searches',

labels: [:tenant_id]

)



# Circuit breaker metrics

CIRCUIT_BREAKER_STATE = PROMETHEUS.gauge(

:circuit_breaker_state,

docstring: 'Circuit breaker state (0=closed, 1=open, 2=half_open)',

labels: [:tenant_id]

)



CIRCUIT_BREAKER_FAILURES = PROMETHEUS.counter(

:circuit_breaker_failures_total,

docstring: 'Total circuit breaker failures',

labels: [:tenant_id]

)



# Document processing metrics

DOCUMENT_PROCESSING_DURATION = PROMETHEUS.histogram(

:document_processing_duration_seconds,

docstring: 'Document processing duration in seconds',

labels: [:status],

buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300]

)



DOCUMENT_PROCESSING_TOTAL = PROMETHEUS.counter(

:document_processing_total,

docstring: 'Total documents processed',

labels: [:status]

)



# Sidekiq queue metrics

SIDEKIQ_QUEUE_SIZE = PROMETHEUS.gauge(

:sidekiq_queue_size,

docstring: 'Current Sidekiq queue size',

labels: [:queue]

)

Enter fullscreen mode Exit fullscreen mode

Instrumenting Vector Search


# app/services/vector_indexing_service.rb

def search(tenant_id:, query:, filters: {}, limit: 20)

start_time = Time.current



begin

# Check cache first

cache_key = generate_cache_key(query, filters, limit)

cached_result = Rails.cache.read(cache_key)



if cached_result

VECTOR_SEARCH_CACHE_HIT.increment(labels: { tenant_id: tenant_id })

return cached_result

end



# Perform search with resilience

limiter = RateLimiterRegistry.for_tenant(tenant_id)

breaker = CircuitBreakerRegistry.for_tenant(tenant_id)



result = limiter.acquire do

breaker.call(fallback: -> { fallback_search(tenant_id, query, limit) }) do

perform_vector_search(query, filters, limit, tenant_id)

end

end



# Cache result

Rails.cache.write(cache_key, result, expires_in: 1.hour)



# Record success

duration = Time.current - start_time

VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'success' })

VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'success' })



result

rescue StandardError => e

# Record failure

duration = Time.current - start_time

VECTOR_SEARCH_DURATION.observe(duration, labels: { tenant_id: tenant_id, status: 'error' })

VECTOR_SEARCH_TOTAL.increment(labels: { tenant_id: tenant_id, status: 'error' })



raise

end

end

Enter fullscreen mode Exit fullscreen mode

Metrics Endpoint


# config/routes.rb

Rails.application.routes.draw do

get '/metrics', to: 'metrics#show'

end



# app/controllers/metrics_controller.rb

class MetricsController < ApplicationController

skip_before_action :authenticate_user!



def show

# Update circuit breaker metrics

CircuitBreakerRegistry.stats.each do |name, stats|

state_value = case stats[:state]

when :closed then 0

when :open then 1

when :half_open then 2

end



CIRCUIT_BREAKER_STATE.set(state_value, labels: { tenant_id: name })

end



# Update Sidekiq queue metrics

Sidekiq::Queue.all.each do |queue|

SIDEKIQ_QUEUE_SIZE.set(queue.size, labels: { queue: queue.name })

end



# Return Prometheus format

render plain: Prometheus::Client::Formats::Text.marshal(PROMETHEUS),

content_type: Prometheus::Client::Formats::Text::CONTENT_TYPE

end

end

Enter fullscreen mode Exit fullscreen mode

Grafana Dashboards

Key Panels

*1. Search Performance*


# P50 latency

histogram_quantile(0.50,

rate(vector_search_duration_seconds_bucket[5m]))



# P95 latency

histogram_quantile(0.95,

rate(vector_search_duration_seconds_bucket[5m]))



# P99 latency

histogram_quantile(0.99,

rate(vector_search_duration_seconds_bucket[5m]))

Enter fullscreen mode Exit fullscreen mode

*2. Throughput*


# Requests per second

rate(vector_search_total[1m])



# By tenant

sum(rate(vector_search_total[5m])) by (tenant_id)

Enter fullscreen mode Exit fullscreen mode

*3. Cache Hit Rate*


# Cache hit rate percentage

100 * (

rate(vector_search_cache_hit_total[5m]) /

rate(vector_search_total[5m])

)

Enter fullscreen mode Exit fullscreen mode

*4. Circuit Breaker Status*


# Count of open circuits

count(circuit_breaker_state == 1)



# Circuit state by tenant

circuit_breaker_state

Enter fullscreen mode Exit fullscreen mode

*5. Error Rate*


# Error percentage

100 * (

rate(vector_search_total{status="error"}[5m]) /

rate(vector_search_total[5m])

)

Enter fullscreen mode Exit fullscreen mode

Dashboard JSON

See full dashboard configuration: examples/grafana-dashboard.json in the Vectra repo.


Production Metrics (Real Numbers)

After implementing resilience patterns:


Search Latency:

- P50: 45ms (cached: 3ms)

- P95: 120ms (cached: 8ms)

- P99: 250ms



Availability:

- Uptime: 99.94% (exceeding 99.9% SLA)

- MTTR: 8 minutes average

- Longest incident: 45 minutes (Qdrant upgrade)



Circuit Breaker Stats:

- Opened: 12 times in 2024

- Average recovery time: 2.5 minutes

- Prevented outages: 8 incidents



Cache Performance:

- Hit rate: 42%

- Memory usage: 2.1GB

- Evictions: 145/day



Rate Limiting:

- Triggered: 234 times/month

- Top abuser: 1,847 requests in 1 minute

- No successful DOS attacks

Enter fullscreen mode Exit fullscreen mode

What We've Covered

✅ *Circuit Breakers* - Prevented 8 outages in 2024

✅ *Rate Limiting* - Per-tenant isolation, no DOS attacks

✅ *Health Checks* - Kubernetes-ready monitoring

✅ *Prometheus Metrics* - Real-time observability

✅ *Grafana Dashboards* - Beautiful visualizations


Coming in Part 3: Cost Optimization 💰

We've built a resilient system. Now let's talk money.

In *Part 3* (Friday), we'll cover:

  • *Infrastructure Cost Breakdown* - $600/month vs $3,200/month (81% savings)

  • *Cache Optimization* - 42% hit rate = 70% latency reduction

  • *Chunking Strategy* - Why 512 tokens with 128 overlap beats 1000 tokens

  • *Self-Hosting Embeddings* - $200/month savings vs OpenAI

  • *5 Key Lessons Learned* - What worked, what didn't

*Spoiler:* The biggest savings came from caching, not from switching providers.


Resources

  • *Vectra Gem*: github.com/stokry/vectra

  • *Grafana Dashboard*: examples/grafana-dashboard.json

  • *Prometheus Exporter*: examples/prometheus-exporter.rb

Questions about production resilience? Share your own war stories in the comments!

Top comments (0)