This is Part 1 of a 3-part series on building production-ready vector search for enterprise SaaS.
- Part 1: Architecture & Implementation 👈 You are here
- Part 2: Production Resilience & Monitoring (Coming Wednesday)
- Part 3: Cost Optimization & Lessons Learned (Coming Friday)
TL;DR: Deep dive into building an enterprise SaaS platform that processes 2M+ compliance documents monthly using vector search. This part covers the architecture, Rails implementation, and multi-tenant isolation patterns.
The Business Problem
The scenario: An enterprise SaaS platform serving Fortune 500 companies managing regulatory compliance - banks, healthcare, and pharma companies dealing with complex regulatory requirements.
The pain point: Organizations receive thousands of regulatory documents monthly (SEC filings, FDA guidelines, ISO standards, GDPR updates). Compliance teams spend 60+ hours/week manually searching through PDFs to find relevant sections.
The challenge: Build AI-powered semantic search that understands regulatory language and returns precise results across millions of documents.
The constraints:
- 150+ enterprise clients (multi-tenant architecture)
- 2.1M documents total, growing 50k/month
- Average document: 200 pages, 500KB
- SOC2 + GDPR compliant (audit logs, data isolation)
- 99.9% uptime SLA ($10k/hour penalties)
- Budget: $2,500/month for infrastructure
The Architecture: What We Built
High-Level Overview
Key Components:
- Rails API - Multi-tenant document processing
- Vectra - Unified vector DB client (provider-agnostic)
- Qdrant - Self-hosted vector database (cost + GDPR compliance)
- Sidekiq - Background job processing
- Sentence-Transformers - Self-hosted embedding generation
Multi-Tenant Document Processing
The Challenge: Processing PDFs at Scale
Input: Client uploads a 300-page PDF (SEC 10-K filing)
Requirements:
- Extract text from PDF
- Split into searchable chunks (with overlap for context)
- Generate embeddings for each chunk
- Index with tenant isolation
- Track processing status
- Audit trail for compliance
The Implementation
1. Document Model (Rails)
# app/models/document.rb
class Document < ApplicationRecord
belongs_to :tenant
has_many :document_chunks, dependent: :destroy
# Store embeddings as binary
serialize :metadata, JSON
enum status: {
pending: 0,
processing: 1,
indexed: 2,
failed: 3
}
# Validations
validates :title, :file_url, presence: true
validates :tenant_id, presence: true
# Callbacks
after_create :schedule_processing
def schedule_processing
DocumentProcessingJob.perform_later(id)
end
end
Why this model?
-
tenant_idensures every document belongs to a tenant -
statusenum tracks processing pipeline -
document_chunksstores split chunks with embeddings - Background job keeps upload fast (< 300ms)
2. Chunk Splitting Strategy
Problem: You can't embed an entire 300-page document. You need to split it intelligently.
Solution: Sliding window with overlap:
# app/services/document_chunker.rb
class DocumentChunker
CHUNK_SIZE = 512 # tokens (~400 words)
CHUNK_OVERLAP = 128 # tokens for context continuity
def initialize(document)
@document = document
@text = extract_text_from_pdf(document.file_url)
end
def split
# Tokenize text
tokens = tokenize(@text)
chunks = []
# Sliding window
position = 0
while position < tokens.length
chunk_tokens = tokens[position...(position + CHUNK_SIZE)]
# Convert back to text
chunk_text = detokenize(chunk_tokens)
chunks << {
text: chunk_text,
position: position,
page_start: calculate_page(position),
page_end: calculate_page(position + CHUNK_SIZE)
}
# Slide window with overlap
position += (CHUNK_SIZE - CHUNK_OVERLAP)
end
chunks
end
private
def extract_text_from_pdf(file_url)
# Using pdf-reader gem
pdf = PDF::Reader.new(file_url)
pdf.pages.map(&:text).join("\n")
end
def tokenize(text)
# Simple whitespace tokenization (use tiktoken for production)
text.split(/\s+/)
end
def calculate_page(token_position)
# Average 300 tokens per page
(token_position / 300.0).ceil
end
end
Why overlap? Without it, search misses context at chunk boundaries. With 128-token overlap, we improved precision from 0.73 → 0.89.
3. Background Processing Job
# app/jobs/document_processing_job.rb
class DocumentProcessingJob < ApplicationJob
queue_as :document_processing
# Retry with exponential backoff
retry_on StandardError, wait: :exponentially_longer, attempts: 5
def perform(document_id)
document = Document.find(document_id)
document.update!(status: :processing)
# Step 1: Split document into chunks
chunker = DocumentChunker.new(document)
chunks = chunker.split
# Step 2: Generate embeddings for each chunk
embedder = EmbeddingService.new
chunk_records = []
chunks.each_with_index do |chunk, idx|
embedding = embedder.generate(chunk[:text])
chunk_record = DocumentChunk.create!(
document: document,
text: chunk[:text],
position: chunk[:position],
page_start: chunk[:page_start],
page_end: chunk[:page_end],
embedding: embedding
)
chunk_records << chunk_record
end
# Step 3: Batch index to vector DB (tenant-isolated)
VectorIndexingService.new.index_document(document, chunk_records)
# Step 4: Mark as indexed
document.update!(
status: :indexed,
chunk_count: chunks.size,
indexed_at: Time.current
)
# Step 5: Audit log
AuditLog.create!(
tenant_id: document.tenant_id,
event_type: 'document_indexed',
resource_type: 'Document',
resource_id: document.id,
metadata: {
chunk_count: chunks.size,
file_size: document.file_size,
duration_ms: (Time.current - document.created_at) * 1000
}
)
rescue StandardError => e
document.update!(status: :failed, error_message: e.message)
Sentry.capture_exception(e, extra: { document_id: document.id })
raise
end
end
Performance:
- Processes 15 documents/minute
- Average processing time: 4 seconds per document
- Failure rate: 0.3% (mostly PDF parsing issues)
4. Embedding Service (Self-Hosted)
Why self-hosted? GDPR compliance - we can't send client data to OpenAI.
# app/services/embedding_service.rb
class EmbeddingService
EMBEDDING_ENDPOINT = ENV.fetch('EMBEDDING_SERVICE_URL', 'http://localhost:8080')
MODEL = 'all-mpnet-base-v2' # 768 dimensions
def generate(text)
# Add retry logic
response = Faraday.new(url: EMBEDDING_ENDPOINT) do |f|
f.request :json
f.response :json
f.adapter Faraday.default_adapter
f.options.timeout = 30
f.options.open_timeout = 10
end.post('/embeddings', {
text: text,
model: MODEL
})
raise "Embedding failed: #{response.status}" unless response.success?
response.body['embedding']
rescue Faraday::Error => e
Rails.logger.error("Embedding service error: #{e.message}")
raise
end
end
Cost savings: Self-hosting saves $200/month vs OpenAI API (2M embeddings/month).
Vector Search with Multi-Tenant Isolation
The Challenge: Tenant Data Isolation
Critical requirement: Client A must NEVER see Client B's documents.
Approach: Qdrant namespaces + application-level verification (defense in depth)
# app/services/vector_indexing_service.rb
class VectorIndexingService
def initialize
@client = build_client
end
# Index entire document (batched)
def index_document(document, chunk_records)
vectors = chunk_records.map do |chunk|
{
id: vector_id(chunk),
values: chunk.embedding,
metadata: {
document_id: document.id,
tenant_id: document.tenant_id,
title: document.title,
page_start: chunk.page_start,
page_end: chunk.page_end,
chunk_text: chunk.text[0..500], # Preview only
indexed_at: Time.current.iso8601
}
}
end
# Batch upsert with tenant namespace
batch = Vectra::Batch.new(@client)
result = batch.upsert_async(
index: 'compliance_documents',
vectors: vectors,
namespace: namespace_for_tenant(document.tenant_id),
concurrency: 4
)
Rails.logger.info(
"Indexed document #{document.id} for tenant #{document.tenant_id}: " \
"#{result[:success]} chunks"
)
result
end
# Search within tenant (isolated)
def search(tenant_id:, query:, filters: {}, limit: 20)
# Generate query embedding
query_embedding = EmbeddingService.new.generate(query)
# Ensure tenant isolation
namespace = namespace_for_tenant(tenant_id)
# Query vector DB
results = @client.query(
index: 'compliance_documents',
vector: query_embedding,
top_k: limit,
namespace: namespace, # CRITICAL: tenant isolation
filter: filters,
include_metadata: true
)
# Verify tenant_id in results (defense in depth)
verified_results = results.select do |match|
match.metadata['tenant_id'] == tenant_id
end
# Log potential security issue
if verified_results.size != results.size
SecurityAlert.create!(
severity: 'critical',
message: "Tenant isolation breach detected",
details: {
tenant_id: tenant_id,
expected: results.size,
verified: verified_results.size
}
)
end
verified_results
end
private
def build_client
# Cached client with resilience
base_client = Vectra.qdrant(
host: ENV.fetch('QDRANT_HOST'),
api_key: ENV['QDRANT_API_KEY'],
timeout: 10,
max_retries: 3
)
# Add caching layer
cache = Vectra::Cache.new(
ttl: 3600, # 1 hour for search results
max_size: 5000 # Top 5000 queries cached
)
Vectra::CachedClient.new(base_client, cache: cache)
end
def namespace_for_tenant(tenant_id)
"tenant_#{tenant_id}"
end
def vector_id(chunk)
"chunk_#{chunk.id}"
end
end
Security Best Practice: Never Trust, Always Verify
# NEVER trust the namespace alone
results = client.query(namespace: "tenant_#{tenant_id}", ...)
# ALWAYS verify in application
verified = results.select { |r| r.metadata['tenant_id'] == tenant_id }
if verified.size != results.size
# SECURITY BREACH - alert immediately
SecurityAlert.critical!("Tenant isolation breach detected")
end
Search API with Enterprise Features
The Controller
# app/controllers/api/v1/search_controller.rb
module Api
module V1
class SearchController < ApiController
before_action :authenticate_user!
before_action :rate_limit_check
# POST /api/v1/search
def create
query = params.require(:query)
# Validate query
if query.blank? || query.length < 3
return render json: { error: 'Query too short' }, status: :bad_request
end
# Build filters from params
filters = build_filters(params[:filters])
# Perform search (with timing)
start_time = Time.current
results = VectorIndexingService.new.search(
tenant_id: current_tenant.id,
query: query,
filters: filters,
limit: params[:limit] || 20
)
duration_ms = ((Time.current - start_time) * 1000).round(2)
# Hydrate results (load Document records)
documents = hydrate_results(results)
# Log search for analytics
SearchLog.create!(
tenant_id: current_tenant.id,
user_id: current_user.id,
query: query,
result_count: results.size,
duration_ms: duration_ms,
filters: filters
)
# Audit log for compliance
AuditLog.create!(
tenant_id: current_tenant.id,
user_id: current_user.id,
event_type: 'document_search',
metadata: {
query: query,
result_count: results.size,
duration_ms: duration_ms
}
)
render json: {
results: documents.map { |doc| DocumentSerializer.new(doc).as_json },
metadata: {
total: results.size,
duration_ms: duration_ms,
cached: cache_hit?(results)
}
}
rescue Vectra::Error => e
# Handle vector DB errors gracefully
Rails.logger.error("Vector search error: #{e.message}")
Sentry.capture_exception(e)
# Fallback to SQL search
fallback_results = fallback_search(query)
render json: {
results: fallback_results,
metadata: {
fallback: true,
error: 'Vector search unavailable'
}
}, status: :partial_content
end
private
def build_filters(filter_params)
return {} unless filter_params.present?
filters = {}
filters[:document_type] = filter_params[:document_type] if filter_params[:document_type]
filters[:year] = filter_params[:year].to_i if filter_params[:year]
filters[:regulatory_body] = filter_params[:regulatory_body] if filter_params[:regulatory_body]
filters
end
def hydrate_results(results)
# Extract document IDs from vector search results
document_ids = results.map { |r| r.metadata['document_id'] }.uniq
# Load documents from DB
documents = Document.where(id: document_ids, tenant_id: current_tenant.id)
.index_by(&:id)
# Attach scores to documents
results.map do |match|
doc = documents[match.metadata['document_id']]
next unless doc
doc.instance_variable_set(:@search_score, match.score)
doc.instance_variable_set(:@matched_chunk, match.metadata['chunk_text'])
doc.instance_variable_set(:@matched_pages, "#{match.metadata['page_start']}-#{match.metadata['page_end']}")
doc.define_singleton_method(:search_score) { @search_score }
doc.define_singleton_method(:matched_chunk) { @matched_chunk }
doc.define_singleton_method(:matched_pages) { @matched_pages }
doc
end.compact
end
def rate_limit_check
# 100 searches per hour per user
key = "search_rate_limit:#{current_user.id}"
count = Rails.cache.increment(key, 1, expires_in: 1.hour)
if count > 100
render json: { error: 'Rate limit exceeded' }, status: :too_many_requests
end
end
end
end
end
API Response Example:
{
"results": [
{
"id": 1234,
"title": "SEC 10-K Filing 2024",
"search_score": 0.89,
"matched_chunk": "The company faces significant regulatory risks...",
"matched_pages": "45-46"
}
],
"metadata": {
"total": 10,
"duration_ms": 45,
"cached": false
}
}
What We've Built So Far
✅ Multi-tenant document processing pipeline
- PDF → Text extraction
- Intelligent chunking with overlap
- Self-hosted embeddings (GDPR compliant)
- Background job processing
✅ Secure vector search
- Qdrant namespace isolation
- Application-level verification
- Audit logging
- Graceful fallback to SQL
✅ Production-ready API
- Fast response times (45ms P50)
- Comprehensive error handling
- Rate limiting
- Search analytics
Coming in Part 2: Production Resilience 🛡️
Building the search feature is only half the battle. In Part 2 (Wednesday), we'll cover:
- Circuit Breakers - How they saved us during Black Friday when Qdrant overloaded
- Rate Limiting - Per-tenant throttling to prevent abuse
- Health Checks - Kubernetes-ready monitoring
- Prometheus Metrics - Real-time observability
- Grafana Dashboards - Visualizing search performance
Real incident story: Our Qdrant cluster hit 98% CPU during a traffic spike. Without circuit breakers, we would have had a complete outage. Instead, 99.2% of searches still worked using cached results and PostgreSQL fallback.
Resources
- Vectra Gem: github.com/stokry/vectra
- Documentation: vectra-docs.netlify.app
-
Example Code:
examples/comprehensive_demo.rb
Questions about the architecture or implementation? Drop a comment below!

Top comments (0)