Stokry

Posted on Jan 11 • Edited on Jan 13

Building Enterprise Vector Search in Rails (Part 1/3): Architecture & Multi-Tenant Implementation

#rails #ruby #ai #vectordatabase

This is Part 1 of a 3-part series on building production-ready vector search for enterprise SaaS.

Part 1: Architecture & Implementation 👈 You are here

Part 2: Production Resilience & Monitoring

Part 3: Cost Optimization & Lessons Learned (Coming Friday)

TL;DR: Deep dive into building an enterprise SaaS platform that processes 2M+ compliance documents monthly using vector search. This part covers the architecture, Rails implementation, and multi-tenant isolation patterns.

The Business Problem

The scenario: An enterprise SaaS platform serving Fortune 500 companies managing regulatory compliance - banks, healthcare, and pharma companies dealing with complex regulatory requirements.

The pain point: Organizations receive thousands of regulatory documents monthly (SEC filings, FDA guidelines, ISO standards, GDPR updates). Compliance teams spend 60+ hours/week manually searching through PDFs to find relevant sections.

The challenge: Build AI-powered semantic search that understands regulatory language and returns precise results across millions of documents.

The constraints:

150+ enterprise clients (multi-tenant architecture)
2.1M documents total, growing 50k/month
Average document: 200 pages, 500KB
SOC2 + GDPR compliant (audit logs, data isolation)
99.9% uptime SLA ($10k/hour penalties)
Budget: $2,500/month for infrastructure

The Architecture: What We Built

High-Level Overview

Key Components:

Rails API - Multi-tenant document processing
Vectra - Unified vector DB client (provider-agnostic)
Qdrant - Self-hosted vector database (cost + GDPR compliance)
Sidekiq - Background job processing
Sentence-Transformers - Self-hosted embedding generation

Multi-Tenant Document Processing

The Challenge: Processing PDFs at Scale

Input: Client uploads a 300-page PDF (SEC 10-K filing)

Requirements:

Extract text from PDF
Split into searchable chunks (with overlap for context)
Generate embeddings for each chunk
Index with tenant isolation
Track processing status
Audit trail for compliance

The Implementation

1. Document Model (Rails)

# app/models/document.rb
class Document < ApplicationRecord
  belongs_to :tenant
  has_many :document_chunks, dependent: :destroy

  # Store embeddings as binary
  serialize :metadata, JSON

  enum status: {
    pending: 0,
    processing: 1,
    indexed: 2,
    failed: 3
  }

  # Validations
  validates :title, :file_url, presence: true
  validates :tenant_id, presence: true

  # Callbacks
  after_create :schedule_processing

  def schedule_processing
    DocumentProcessingJob.perform_later(id)
  end
end

Why this model?

tenant_id ensures every document belongs to a tenant
status enum tracks processing pipeline
document_chunks stores split chunks with embeddings
Background job keeps upload fast (< 300ms)

2. Chunk Splitting Strategy

Problem: You can't embed an entire 300-page document. You need to split it intelligently.

Solution: Sliding window with overlap:

# app/services/document_chunker.rb
class DocumentChunker
  CHUNK_SIZE = 512        # tokens (~400 words)
  CHUNK_OVERLAP = 128     # tokens for context continuity

  def initialize(document)
    @document = document
    @text = extract_text_from_pdf(document.file_url)
  end

  def split
    # Tokenize text
    tokens = tokenize(@text)
    chunks = []

    # Sliding window
    position = 0
    while position < tokens.length
      chunk_tokens = tokens[position...(position + CHUNK_SIZE)]

      # Convert back to text
      chunk_text = detokenize(chunk_tokens)

      chunks << {
        text: chunk_text,
        position: position,
        page_start: calculate_page(position),
        page_end: calculate_page(position + CHUNK_SIZE)
      }

      # Slide window with overlap
      position += (CHUNK_SIZE - CHUNK_OVERLAP)
    end

    chunks
  end

  private

  def extract_text_from_pdf(file_url)
    # Using pdf-reader gem
    pdf = PDF::Reader.new(file_url)
    pdf.pages.map(&:text).join("\n")
  end

  def tokenize(text)
    # Simple whitespace tokenization (use tiktoken for production)
    text.split(/\s+/)
  end

  def calculate_page(token_position)
    # Average 300 tokens per page
    (token_position / 300.0).ceil
  end
end

Why overlap? Without it, search misses context at chunk boundaries. With 128-token overlap, we improved precision from 0.73 → 0.89.

3. Background Processing Job

# app/jobs/document_processing_job.rb
class DocumentProcessingJob < ApplicationJob
  queue_as :document_processing

  # Retry with exponential backoff
  retry_on StandardError, wait: :exponentially_longer, attempts: 5

  def perform(document_id)
    document = Document.find(document_id)
    document.update!(status: :processing)

    # Step 1: Split document into chunks
    chunker = DocumentChunker.new(document)
    chunks = chunker.split

    # Step 2: Generate embeddings for each chunk
    embedder = EmbeddingService.new
    chunk_records = []

    chunks.each_with_index do |chunk, idx|
      embedding = embedder.generate(chunk[:text])

      chunk_record = DocumentChunk.create!(
        document: document,
        text: chunk[:text],
        position: chunk[:position],
        page_start: chunk[:page_start],
        page_end: chunk[:page_end],
        embedding: embedding
      )

      chunk_records << chunk_record
    end

    # Step 3: Batch index to vector DB (tenant-isolated)
    VectorIndexingService.new.index_document(document, chunk_records)

    # Step 4: Mark as indexed
    document.update!(
      status: :indexed,
      chunk_count: chunks.size,
      indexed_at: Time.current
    )

    # Step 5: Audit log
    AuditLog.create!(
      tenant_id: document.tenant_id,
      event_type: 'document_indexed',
      resource_type: 'Document',
      resource_id: document.id,
      metadata: {
        chunk_count: chunks.size,
        file_size: document.file_size,
        duration_ms: (Time.current - document.created_at) * 1000
      }
    )

  rescue StandardError => e
    document.update!(status: :failed, error_message: e.message)
    Sentry.capture_exception(e, extra: { document_id: document.id })
    raise
  end
end

Performance:

Processes 15 documents/minute
Average processing time: 4 seconds per document
Failure rate: 0.3% (mostly PDF parsing issues)

4. Embedding Service (Self-Hosted)

Why self-hosted? GDPR compliance - we can't send client data to OpenAI.

# app/services/embedding_service.rb
class EmbeddingService
  EMBEDDING_ENDPOINT = ENV.fetch('EMBEDDING_SERVICE_URL', 'http://localhost:8080')
  MODEL = 'all-mpnet-base-v2'  # 768 dimensions

  def generate(text)
    # Add retry logic
    response = Faraday.new(url: EMBEDDING_ENDPOINT) do |f|
      f.request :json
      f.response :json
      f.adapter Faraday.default_adapter
      f.options.timeout = 30
      f.options.open_timeout = 10
    end.post('/embeddings', {
      text: text,
      model: MODEL
    })

    raise "Embedding failed: #{response.status}" unless response.success?

    response.body['embedding']
  rescue Faraday::Error => e
    Rails.logger.error("Embedding service error: #{e.message}")
    raise
  end
end

Cost savings: Self-hosting saves $200/month vs OpenAI API (2M embeddings/month).

Vector Search with Multi-Tenant Isolation

The Challenge: Tenant Data Isolation

Critical requirement: Client A must NEVER see Client B's documents.

Approach: Qdrant namespaces + application-level verification (defense in depth)

# app/services/vector_indexing_service.rb
class VectorIndexingService
  def initialize
    @client = build_client
  end

  # Index entire document (batched)
  def index_document(document, chunk_records)
    vectors = chunk_records.map do |chunk|
      {
        id: vector_id(chunk),
        values: chunk.embedding,
        metadata: {
          document_id: document.id,
          tenant_id: document.tenant_id,
          title: document.title,
          page_start: chunk.page_start,
          page_end: chunk.page_end,
          chunk_text: chunk.text[0..500], # Preview only
          indexed_at: Time.current.iso8601
        }
      }
    end

    # Batch upsert with tenant namespace
    batch = Vectra::Batch.new(@client)
    result = batch.upsert_async(
      index: 'compliance_documents',
      vectors: vectors,
      namespace: namespace_for_tenant(document.tenant_id),
      concurrency: 4
    )

    Rails.logger.info(
      "Indexed document #{document.id} for tenant #{document.tenant_id}: " \
      "#{result[:success]} chunks"
    )

    result
  end

  # Search within tenant (isolated)
  def search(tenant_id:, query:, filters: {}, limit: 20)
    # Generate query embedding
    query_embedding = EmbeddingService.new.generate(query)

    # Ensure tenant isolation
    namespace = namespace_for_tenant(tenant_id)

    # Query vector DB
    results = @client.query(
      index: 'compliance_documents',
      vector: query_embedding,
      top_k: limit,
      namespace: namespace,  # CRITICAL: tenant isolation
      filter: filters,
      include_metadata: true
    )

    # Verify tenant_id in results (defense in depth)
    verified_results = results.select do |match|
      match.metadata['tenant_id'] == tenant_id
    end

    # Log potential security issue
    if verified_results.size != results.size
      SecurityAlert.create!(
        severity: 'critical',
        message: "Tenant isolation breach detected",
        details: {
          tenant_id: tenant_id,
          expected: results.size,
          verified: verified_results.size
        }
      )
    end

    verified_results
  end

  private

  def build_client
    # Cached client with resilience
    base_client = Vectra.qdrant(
      host: ENV.fetch('QDRANT_HOST'),
      api_key: ENV['QDRANT_API_KEY'],
      timeout: 10,
      max_retries: 3
    )

    # Add caching layer
    cache = Vectra::Cache.new(
      ttl: 3600,      # 1 hour for search results
      max_size: 5000  # Top 5000 queries cached
    )

    Vectra::CachedClient.new(base_client, cache: cache)
  end

  def namespace_for_tenant(tenant_id)
    "tenant_#{tenant_id}"
  end

  def vector_id(chunk)
    "chunk_#{chunk.id}"
  end
end

Security Best Practice: Never Trust, Always Verify

# NEVER trust the namespace alone
results = client.query(namespace: "tenant_#{tenant_id}", ...)

# ALWAYS verify in application
verified = results.select { |r| r.metadata['tenant_id'] == tenant_id }

if verified.size != results.size
  # SECURITY BREACH - alert immediately
  SecurityAlert.critical!("Tenant isolation breach detected")
end

Search API with Enterprise Features

The Controller

# app/controllers/api/v1/search_controller.rb
module Api
  module V1
    class SearchController < ApiController
      before_action :authenticate_user!
      before_action :rate_limit_check

      # POST /api/v1/search
      def create
        query = params.require(:query)

        # Validate query
        if query.blank? || query.length < 3
          return render json: { error: 'Query too short' }, status: :bad_request
        end

        # Build filters from params
        filters = build_filters(params[:filters])

        # Perform search (with timing)
        start_time = Time.current

        results = VectorIndexingService.new.search(
          tenant_id: current_tenant.id,
          query: query,
          filters: filters,
          limit: params[:limit] || 20
        )

        duration_ms = ((Time.current - start_time) * 1000).round(2)

        # Hydrate results (load Document records)
        documents = hydrate_results(results)

        # Log search for analytics
        SearchLog.create!(
          tenant_id: current_tenant.id,
          user_id: current_user.id,
          query: query,
          result_count: results.size,
          duration_ms: duration_ms,
          filters: filters
        )

        # Audit log for compliance
        AuditLog.create!(
          tenant_id: current_tenant.id,
          user_id: current_user.id,
          event_type: 'document_search',
          metadata: {
            query: query,
            result_count: results.size,
            duration_ms: duration_ms
          }
        )

        render json: {
          results: documents.map { |doc| DocumentSerializer.new(doc).as_json },
          metadata: {
            total: results.size,
            duration_ms: duration_ms,
            cached: cache_hit?(results)
          }
        }
      rescue Vectra::Error => e
        # Handle vector DB errors gracefully
        Rails.logger.error("Vector search error: #{e.message}")
        Sentry.capture_exception(e)

        # Fallback to SQL search
        fallback_results = fallback_search(query)

        render json: {
          results: fallback_results,
          metadata: {
            fallback: true,
            error: 'Vector search unavailable'
          }
        }, status: :partial_content
      end

      private

      def build_filters(filter_params)
        return {} unless filter_params.present?

        filters = {}
        filters[:document_type] = filter_params[:document_type] if filter_params[:document_type]
        filters[:year] = filter_params[:year].to_i if filter_params[:year]
        filters[:regulatory_body] = filter_params[:regulatory_body] if filter_params[:regulatory_body]
        filters
      end

      def hydrate_results(results)
        # Extract document IDs from vector search results
        document_ids = results.map { |r| r.metadata['document_id'] }.uniq

        # Load documents from DB
        documents = Document.where(id: document_ids, tenant_id: current_tenant.id)
                           .index_by(&:id)

        # Attach scores to documents
        results.map do |match|
          doc = documents[match.metadata['document_id']]
          next unless doc

          doc.instance_variable_set(:@search_score, match.score)
          doc.instance_variable_set(:@matched_chunk, match.metadata['chunk_text'])
          doc.instance_variable_set(:@matched_pages, "#{match.metadata['page_start']}-#{match.metadata['page_end']}")

          doc.define_singleton_method(:search_score) { @search_score }
          doc.define_singleton_method(:matched_chunk) { @matched_chunk }
          doc.define_singleton_method(:matched_pages) { @matched_pages }

          doc
        end.compact
      end

      def rate_limit_check
        # 100 searches per hour per user
        key = "search_rate_limit:#{current_user.id}"
        count = Rails.cache.increment(key, 1, expires_in: 1.hour)

        if count > 100
          render json: { error: 'Rate limit exceeded' }, status: :too_many_requests
        end
      end
    end
  end
end

API Response Example:

{
  "results": [
    {
      "id": 1234,
      "title": "SEC 10-K Filing 2024",
      "search_score": 0.89,
      "matched_chunk": "The company faces significant regulatory risks...",
      "matched_pages": "45-46"
    }
  ],
  "metadata": {
    "total": 10,
    "duration_ms": 45,
    "cached": false
  }
}

What We've Built So Far

✅ Multi-tenant document processing pipeline

PDF → Text extraction
Intelligent chunking with overlap
Self-hosted embeddings (GDPR compliant)
Background job processing

✅ Secure vector search

Qdrant namespace isolation
Application-level verification
Audit logging
Graceful fallback to SQL

✅ Production-ready API

Fast response times (45ms P50)
Comprehensive error handling
Rate limiting
Search analytics

Coming in Part 2: Production Resilience 🛡️

Building the search feature is only half the battle. In Part 2 (Wednesday), we'll cover:

Circuit Breakers - How they saved us during Black Friday when Qdrant overloaded
Rate Limiting - Per-tenant throttling to prevent abuse
Health Checks - Kubernetes-ready monitoring
Prometheus Metrics - Real-time observability
Grafana Dashboards - Visualizing search performance

Real incident story: Our Qdrant cluster hit 98% CPU during a traffic spike. Without circuit breakers, we would have had a complete outage. Instead, 99.2% of searches still worked using cached results and PostgreSQL fallback.

Resources

Vectra Gem: github.com/stokry/vectra
Documentation: vectra-docs.netlify.app
Example Code: examples/comprehensive_demo.rb

Questions about the architecture or implementation? Drop a comment below!

DEV Community