DEV Community

Cover image for Building Enterprise Vector Search in Rails (Part 1/3): Architecture & Multi-Tenant Implementation
Stokry
Stokry

Posted on

Building Enterprise Vector Search in Rails (Part 1/3): Architecture & Multi-Tenant Implementation

This is Part 1 of a 3-part series on building production-ready vector search for enterprise SaaS.

  • Part 1: Architecture & Implementation 👈 You are here
  • Part 2: Production Resilience & Monitoring (Coming Wednesday)
  • Part 3: Cost Optimization & Lessons Learned (Coming Friday)

TL;DR: Deep dive into building an enterprise SaaS platform that processes 2M+ compliance documents monthly using vector search. This part covers the architecture, Rails implementation, and multi-tenant isolation patterns.


The Business Problem

The scenario: An enterprise SaaS platform serving Fortune 500 companies managing regulatory compliance - banks, healthcare, and pharma companies dealing with complex regulatory requirements.

The pain point: Organizations receive thousands of regulatory documents monthly (SEC filings, FDA guidelines, ISO standards, GDPR updates). Compliance teams spend 60+ hours/week manually searching through PDFs to find relevant sections.

The challenge: Build AI-powered semantic search that understands regulatory language and returns precise results across millions of documents.

The constraints:

  • 150+ enterprise clients (multi-tenant architecture)
  • 2.1M documents total, growing 50k/month
  • Average document: 200 pages, 500KB
  • SOC2 + GDPR compliant (audit logs, data isolation)
  • 99.9% uptime SLA ($10k/hour penalties)
  • Budget: $2,500/month for infrastructure

The Architecture: What We Built

High-Level Overview

Key Components:

  • Rails API - Multi-tenant document processing
  • Vectra - Unified vector DB client (provider-agnostic)
  • Qdrant - Self-hosted vector database (cost + GDPR compliance)
  • Sidekiq - Background job processing
  • Sentence-Transformers - Self-hosted embedding generation

Multi-Tenant Document Processing

The Challenge: Processing PDFs at Scale

Input: Client uploads a 300-page PDF (SEC 10-K filing)

Requirements:

  • Extract text from PDF
  • Split into searchable chunks (with overlap for context)
  • Generate embeddings for each chunk
  • Index with tenant isolation
  • Track processing status
  • Audit trail for compliance

The Implementation

1. Document Model (Rails)

# app/models/document.rb
class Document < ApplicationRecord
  belongs_to :tenant
  has_many :document_chunks, dependent: :destroy

  # Store embeddings as binary
  serialize :metadata, JSON

  enum status: {
    pending: 0,
    processing: 1,
    indexed: 2,
    failed: 3
  }

  # Validations
  validates :title, :file_url, presence: true
  validates :tenant_id, presence: true

  # Callbacks
  after_create :schedule_processing

  def schedule_processing
    DocumentProcessingJob.perform_later(id)
  end
end
Enter fullscreen mode Exit fullscreen mode

Why this model?

  • tenant_id ensures every document belongs to a tenant
  • status enum tracks processing pipeline
  • document_chunks stores split chunks with embeddings
  • Background job keeps upload fast (< 300ms)

2. Chunk Splitting Strategy

Problem: You can't embed an entire 300-page document. You need to split it intelligently.

Solution: Sliding window with overlap:

# app/services/document_chunker.rb
class DocumentChunker
  CHUNK_SIZE = 512        # tokens (~400 words)
  CHUNK_OVERLAP = 128     # tokens for context continuity

  def initialize(document)
    @document = document
    @text = extract_text_from_pdf(document.file_url)
  end

  def split
    # Tokenize text
    tokens = tokenize(@text)
    chunks = []

    # Sliding window
    position = 0
    while position < tokens.length
      chunk_tokens = tokens[position...(position + CHUNK_SIZE)]

      # Convert back to text
      chunk_text = detokenize(chunk_tokens)

      chunks << {
        text: chunk_text,
        position: position,
        page_start: calculate_page(position),
        page_end: calculate_page(position + CHUNK_SIZE)
      }

      # Slide window with overlap
      position += (CHUNK_SIZE - CHUNK_OVERLAP)
    end

    chunks
  end

  private

  def extract_text_from_pdf(file_url)
    # Using pdf-reader gem
    pdf = PDF::Reader.new(file_url)
    pdf.pages.map(&:text).join("\n")
  end

  def tokenize(text)
    # Simple whitespace tokenization (use tiktoken for production)
    text.split(/\s+/)
  end

  def calculate_page(token_position)
    # Average 300 tokens per page
    (token_position / 300.0).ceil
  end
end
Enter fullscreen mode Exit fullscreen mode

Why overlap? Without it, search misses context at chunk boundaries. With 128-token overlap, we improved precision from 0.73 → 0.89.

3. Background Processing Job

# app/jobs/document_processing_job.rb
class DocumentProcessingJob < ApplicationJob
  queue_as :document_processing

  # Retry with exponential backoff
  retry_on StandardError, wait: :exponentially_longer, attempts: 5

  def perform(document_id)
    document = Document.find(document_id)
    document.update!(status: :processing)

    # Step 1: Split document into chunks
    chunker = DocumentChunker.new(document)
    chunks = chunker.split

    # Step 2: Generate embeddings for each chunk
    embedder = EmbeddingService.new
    chunk_records = []

    chunks.each_with_index do |chunk, idx|
      embedding = embedder.generate(chunk[:text])

      chunk_record = DocumentChunk.create!(
        document: document,
        text: chunk[:text],
        position: chunk[:position],
        page_start: chunk[:page_start],
        page_end: chunk[:page_end],
        embedding: embedding
      )

      chunk_records << chunk_record
    end

    # Step 3: Batch index to vector DB (tenant-isolated)
    VectorIndexingService.new.index_document(document, chunk_records)

    # Step 4: Mark as indexed
    document.update!(
      status: :indexed,
      chunk_count: chunks.size,
      indexed_at: Time.current
    )

    # Step 5: Audit log
    AuditLog.create!(
      tenant_id: document.tenant_id,
      event_type: 'document_indexed',
      resource_type: 'Document',
      resource_id: document.id,
      metadata: {
        chunk_count: chunks.size,
        file_size: document.file_size,
        duration_ms: (Time.current - document.created_at) * 1000
      }
    )

  rescue StandardError => e
    document.update!(status: :failed, error_message: e.message)
    Sentry.capture_exception(e, extra: { document_id: document.id })
    raise
  end
end
Enter fullscreen mode Exit fullscreen mode

Performance:

  • Processes 15 documents/minute
  • Average processing time: 4 seconds per document
  • Failure rate: 0.3% (mostly PDF parsing issues)

4. Embedding Service (Self-Hosted)

Why self-hosted? GDPR compliance - we can't send client data to OpenAI.

# app/services/embedding_service.rb
class EmbeddingService
  EMBEDDING_ENDPOINT = ENV.fetch('EMBEDDING_SERVICE_URL', 'http://localhost:8080')
  MODEL = 'all-mpnet-base-v2'  # 768 dimensions

  def generate(text)
    # Add retry logic
    response = Faraday.new(url: EMBEDDING_ENDPOINT) do |f|
      f.request :json
      f.response :json
      f.adapter Faraday.default_adapter
      f.options.timeout = 30
      f.options.open_timeout = 10
    end.post('/embeddings', {
      text: text,
      model: MODEL
    })

    raise "Embedding failed: #{response.status}" unless response.success?

    response.body['embedding']
  rescue Faraday::Error => e
    Rails.logger.error("Embedding service error: #{e.message}")
    raise
  end
end
Enter fullscreen mode Exit fullscreen mode

Cost savings: Self-hosting saves $200/month vs OpenAI API (2M embeddings/month).


Vector Search with Multi-Tenant Isolation

The Challenge: Tenant Data Isolation

Critical requirement: Client A must NEVER see Client B's documents.

Approach: Qdrant namespaces + application-level verification (defense in depth)

# app/services/vector_indexing_service.rb
class VectorIndexingService
  def initialize
    @client = build_client
  end

  # Index entire document (batched)
  def index_document(document, chunk_records)
    vectors = chunk_records.map do |chunk|
      {
        id: vector_id(chunk),
        values: chunk.embedding,
        metadata: {
          document_id: document.id,
          tenant_id: document.tenant_id,
          title: document.title,
          page_start: chunk.page_start,
          page_end: chunk.page_end,
          chunk_text: chunk.text[0..500], # Preview only
          indexed_at: Time.current.iso8601
        }
      }
    end

    # Batch upsert with tenant namespace
    batch = Vectra::Batch.new(@client)
    result = batch.upsert_async(
      index: 'compliance_documents',
      vectors: vectors,
      namespace: namespace_for_tenant(document.tenant_id),
      concurrency: 4
    )

    Rails.logger.info(
      "Indexed document #{document.id} for tenant #{document.tenant_id}: " \
      "#{result[:success]} chunks"
    )

    result
  end

  # Search within tenant (isolated)
  def search(tenant_id:, query:, filters: {}, limit: 20)
    # Generate query embedding
    query_embedding = EmbeddingService.new.generate(query)

    # Ensure tenant isolation
    namespace = namespace_for_tenant(tenant_id)

    # Query vector DB
    results = @client.query(
      index: 'compliance_documents',
      vector: query_embedding,
      top_k: limit,
      namespace: namespace,  # CRITICAL: tenant isolation
      filter: filters,
      include_metadata: true
    )

    # Verify tenant_id in results (defense in depth)
    verified_results = results.select do |match|
      match.metadata['tenant_id'] == tenant_id
    end

    # Log potential security issue
    if verified_results.size != results.size
      SecurityAlert.create!(
        severity: 'critical',
        message: "Tenant isolation breach detected",
        details: {
          tenant_id: tenant_id,
          expected: results.size,
          verified: verified_results.size
        }
      )
    end

    verified_results
  end

  private

  def build_client
    # Cached client with resilience
    base_client = Vectra.qdrant(
      host: ENV.fetch('QDRANT_HOST'),
      api_key: ENV['QDRANT_API_KEY'],
      timeout: 10,
      max_retries: 3
    )

    # Add caching layer
    cache = Vectra::Cache.new(
      ttl: 3600,      # 1 hour for search results
      max_size: 5000  # Top 5000 queries cached
    )

    Vectra::CachedClient.new(base_client, cache: cache)
  end

  def namespace_for_tenant(tenant_id)
    "tenant_#{tenant_id}"
  end

  def vector_id(chunk)
    "chunk_#{chunk.id}"
  end
end
Enter fullscreen mode Exit fullscreen mode

Security Best Practice: Never Trust, Always Verify

# NEVER trust the namespace alone
results = client.query(namespace: "tenant_#{tenant_id}", ...)

# ALWAYS verify in application
verified = results.select { |r| r.metadata['tenant_id'] == tenant_id }

if verified.size != results.size
  # SECURITY BREACH - alert immediately
  SecurityAlert.critical!("Tenant isolation breach detected")
end
Enter fullscreen mode Exit fullscreen mode

Search API with Enterprise Features

The Controller

# app/controllers/api/v1/search_controller.rb
module Api
  module V1
    class SearchController < ApiController
      before_action :authenticate_user!
      before_action :rate_limit_check

      # POST /api/v1/search
      def create
        query = params.require(:query)

        # Validate query
        if query.blank? || query.length < 3
          return render json: { error: 'Query too short' }, status: :bad_request
        end

        # Build filters from params
        filters = build_filters(params[:filters])

        # Perform search (with timing)
        start_time = Time.current

        results = VectorIndexingService.new.search(
          tenant_id: current_tenant.id,
          query: query,
          filters: filters,
          limit: params[:limit] || 20
        )

        duration_ms = ((Time.current - start_time) * 1000).round(2)

        # Hydrate results (load Document records)
        documents = hydrate_results(results)

        # Log search for analytics
        SearchLog.create!(
          tenant_id: current_tenant.id,
          user_id: current_user.id,
          query: query,
          result_count: results.size,
          duration_ms: duration_ms,
          filters: filters
        )

        # Audit log for compliance
        AuditLog.create!(
          tenant_id: current_tenant.id,
          user_id: current_user.id,
          event_type: 'document_search',
          metadata: {
            query: query,
            result_count: results.size,
            duration_ms: duration_ms
          }
        )

        render json: {
          results: documents.map { |doc| DocumentSerializer.new(doc).as_json },
          metadata: {
            total: results.size,
            duration_ms: duration_ms,
            cached: cache_hit?(results)
          }
        }
      rescue Vectra::Error => e
        # Handle vector DB errors gracefully
        Rails.logger.error("Vector search error: #{e.message}")
        Sentry.capture_exception(e)

        # Fallback to SQL search
        fallback_results = fallback_search(query)

        render json: {
          results: fallback_results,
          metadata: {
            fallback: true,
            error: 'Vector search unavailable'
          }
        }, status: :partial_content
      end

      private

      def build_filters(filter_params)
        return {} unless filter_params.present?

        filters = {}
        filters[:document_type] = filter_params[:document_type] if filter_params[:document_type]
        filters[:year] = filter_params[:year].to_i if filter_params[:year]
        filters[:regulatory_body] = filter_params[:regulatory_body] if filter_params[:regulatory_body]
        filters
      end

      def hydrate_results(results)
        # Extract document IDs from vector search results
        document_ids = results.map { |r| r.metadata['document_id'] }.uniq

        # Load documents from DB
        documents = Document.where(id: document_ids, tenant_id: current_tenant.id)
                           .index_by(&:id)

        # Attach scores to documents
        results.map do |match|
          doc = documents[match.metadata['document_id']]
          next unless doc

          doc.instance_variable_set(:@search_score, match.score)
          doc.instance_variable_set(:@matched_chunk, match.metadata['chunk_text'])
          doc.instance_variable_set(:@matched_pages, "#{match.metadata['page_start']}-#{match.metadata['page_end']}")

          doc.define_singleton_method(:search_score) { @search_score }
          doc.define_singleton_method(:matched_chunk) { @matched_chunk }
          doc.define_singleton_method(:matched_pages) { @matched_pages }

          doc
        end.compact
      end

      def rate_limit_check
        # 100 searches per hour per user
        key = "search_rate_limit:#{current_user.id}"
        count = Rails.cache.increment(key, 1, expires_in: 1.hour)

        if count > 100
          render json: { error: 'Rate limit exceeded' }, status: :too_many_requests
        end
      end
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

API Response Example:

{
  "results": [
    {
      "id": 1234,
      "title": "SEC 10-K Filing 2024",
      "search_score": 0.89,
      "matched_chunk": "The company faces significant regulatory risks...",
      "matched_pages": "45-46"
    }
  ],
  "metadata": {
    "total": 10,
    "duration_ms": 45,
    "cached": false
  }
}
Enter fullscreen mode Exit fullscreen mode

What We've Built So Far

Multi-tenant document processing pipeline

  • PDF → Text extraction
  • Intelligent chunking with overlap
  • Self-hosted embeddings (GDPR compliant)
  • Background job processing

Secure vector search

  • Qdrant namespace isolation
  • Application-level verification
  • Audit logging
  • Graceful fallback to SQL

Production-ready API

  • Fast response times (45ms P50)
  • Comprehensive error handling
  • Rate limiting
  • Search analytics

Coming in Part 2: Production Resilience 🛡️

Building the search feature is only half the battle. In Part 2 (Wednesday), we'll cover:

  • Circuit Breakers - How they saved us during Black Friday when Qdrant overloaded
  • Rate Limiting - Per-tenant throttling to prevent abuse
  • Health Checks - Kubernetes-ready monitoring
  • Prometheus Metrics - Real-time observability
  • Grafana Dashboards - Visualizing search performance

Real incident story: Our Qdrant cluster hit 98% CPU during a traffic spike. Without circuit breakers, we would have had a complete outage. Instead, 99.2% of searches still worked using cached results and PostgreSQL fallback.


Resources

Questions about the architecture or implementation? Drop a comment below!


Top comments (0)