Garrett Yan

Posted on Oct 8

Multi-Tenant vs Multi-Instance: How We Cut SaaS Infrastructure Costs by 78% Without Sacrificing Security

#devops #cloud #cloudcomputing #aws

Running a SaaS platform with 200+ enterprise customers, we were bleeding $45,000/month on AWS infrastructure. Each customer had their own isolated instance - separate EC2s, RDS databases, and Redis clusters. While this gave us perfect isolation and simplified billing, our AWS bill was becoming unsustainable.

After a 6-month migration to a hybrid multi-tenant architecture, we've reduced costs to $10,000/month (78% reduction) while actually improving performance and maintaining enterprise-grade security. Here's exactly how we did it.

The Numbers That Matter

Before: Multi-Instance Architecture

Monthly Costs (200 customers):
- EC2 instances (200 × t3.medium):        $6,000
- RDS PostgreSQL (200 × db.t3.small):    $14,000  
- Redis clusters (200 × cache.t3.micro):  $5,000
- ALB + Target Groups (200 × $20):        $4,000
- NAT Gateways (200 × $45):              $9,000
- CloudWatch + Logs:                      $3,000
- Data Transfer:                           $4,000

Total: $45,000/month ($540,000/year)

After: Multi-Tenant Architecture

Monthly Costs (200 customers):
- EC2 instances (6 × m5.2xlarge):         $1,400
- Aurora PostgreSQL Cluster:              $2,500
- Redis Cluster (3 nodes):                $800
- Single ALB:                             $25
- NAT Gateway (1):                        $45
- CloudWatch + Enhanced Monitoring:       $800
- Data Transfer (optimized):              $500
- Security & Compliance Tools:            $930
- Backup & DR:                            $2,000

Total: $10,000/month ($120,000/year)
Savings: $35,000/month (78% reduction)

Architecture Comparison

Multi-Instance (Original)

Customer_A:
  - Dedicated EC2 instance
  - Dedicated RDS database
  - Dedicated Redis cache
  - Isolated VPC subnet
  - Separate CloudWatch namespace

Customer_B:
  - [Same isolated stack]

Problem: 200× infrastructure overhead

Multi-Tenant (New)

Shared_Infrastructure:
  Application_Tier:
    - 6 × m5.2xlarge EC2 (Auto Scaling 4-8)
    - Shared across all tenants
    - Tenant isolation via JWT tokens

  Database_Tier:
    - Aurora PostgreSQL cluster
    - Row-Level Security (RLS)
    - Tenant-specific schemas
    - Connection pooling (PgBouncer)

  Cache_Tier:
    - Redis Cluster (3 nodes)
    - Keyspace separation (tenant:{id}:*)
    - LRU eviction per tenant limits

Implementation: The Technical Details

1. Database Multi-Tenancy with PostgreSQL RLS

-- Enable Row Level Security
ALTER TABLE products ENABLE ROW LEVEL SECURITY;

-- Add tenant_id to all tables
ALTER TABLE products ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_products_tenant ON products(tenant_id);

-- Create security policy
CREATE POLICY tenant_isolation ON products
    FOR ALL
    TO application_role
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Set tenant context in application
SET LOCAL app.current_tenant = '123e4567-e89b-12d3-a456-426614174000';

2. Application Layer Tenant Isolation

# middleware/tenant_isolation.py
import jwt
from functools import wraps
from flask import request, g
import psycopg2.pool

class TenantIsolation:
    def __init__(self):
        self.pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=20, 
            maxconn=100,
            host=os.getenv('DB_HOST'),
            database=os.getenv('DB_NAME')
        )

    def get_tenant_connection(self, tenant_id):
        conn = self.pool.getconn()
        cursor = conn.cursor()

        # Set tenant context for RLS
        cursor.execute(
            "SET LOCAL app.current_tenant = %s",
            (tenant_id,)
        )

        # Set statement timeout per tenant tier
        timeout = self.get_tenant_timeout(tenant_id)
        cursor.execute(
            "SET LOCAL statement_timeout = %s",
            (timeout,)
        )

        return conn

def require_tenant(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        # Extract tenant from JWT
        token = request.headers.get('Authorization', '').replace('Bearer ', '')

        try:
            payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
            tenant_id = payload['tenant_id']
            g.tenant_id = tenant_id
            g.db_conn = tenant_isolation.get_tenant_connection(tenant_id)
        except jwt.InvalidTokenError:
            return {'error': 'Invalid token'}, 401

        return f(*args, **kwargs)
    return decorated_function

# Usage in routes
@app.route('/api/products')
@require_tenant
def get_products():
    # Queries automatically filtered by tenant
    cursor = g.db_conn.cursor()
    cursor.execute("SELECT * FROM products")  # RLS handles filtering
    return cursor.fetchall()

3. Redis Multi-Tenant Caching Strategy

# cache/tenant_cache.py
import redis
from typing import Optional, Any
import json

class TenantCache:
    def __init__(self):
        self.redis = redis.Redis(
            host='redis-cluster.aws.internal',
            port=6379,
            decode_responses=True,
            connection_pool_kwargs={
                'max_connections': 100,
                'socket_keepalive': True
            }
        )

    def get_tenant_key(self, tenant_id: str, key: str) -> str:
        """Generate tenant-specific key"""
        return f"tenant:{tenant_id}:{key}"

    def get(self, tenant_id: str, key: str) -> Optional[Any]:
        full_key = self.get_tenant_key(tenant_id, key)
        value = self.redis.get(full_key)
        return json.loads(value) if value else None

    def set(self, tenant_id: str, key: str, value: Any, 
            ttl: int = 3600) -> None:
        full_key = self.get_tenant_key(tenant_id, key)

        # Check tenant's cache quota
        if not self.check_tenant_quota(tenant_id):
            # Evict oldest tenant keys if quota exceeded
            self.evict_tenant_keys(tenant_id)

        self.redis.setex(
            full_key,
            ttl,
            json.dumps(value)
        )

    def check_tenant_quota(self, tenant_id: str) -> bool:
        """Check if tenant is within cache quota"""
        pattern = f"tenant:{tenant_id}:*"
        keys = self.redis.scan_iter(match=pattern, count=100)
        count = sum(1 for _ in keys)

        # Different quotas per tier
        tier = self.get_tenant_tier(tenant_id)
        quotas = {
            'free': 1000,
            'pro': 10000,
            'enterprise': 100000
        }

        return count < quotas.get(tier, 1000)

4. Performance Isolation & Resource Limits

# resource_management/tenant_limits.py
from dataclasses import dataclass
from typing import Dict
import asyncio
from asyncio import Semaphore

@dataclass
class TenantLimits:
    max_concurrent_requests: int
    max_db_connections: int
    max_cpu_seconds: int
    max_memory_mb: int
    api_rate_limit: int  # requests per minute

class ResourceManager:
    def __init__(self):
        self.tenant_limits: Dict[str, TenantLimits] = {}
        self.semaphores: Dict[str, Semaphore] = {}

    def get_tenant_limits(self, tenant_id: str) -> TenantLimits:
        tier = self.get_tenant_tier(tenant_id)

        limits = {
            'free': TenantLimits(
                max_concurrent_requests=10,
                max_db_connections=5,
                max_cpu_seconds=60,
                max_memory_mb=512,
                api_rate_limit=100
            ),
            'pro': TenantLimits(
                max_concurrent_requests=50,
                max_db_connections=20,
                max_cpu_seconds=300,
                max_memory_mb=2048,
                api_rate_limit=1000
            ),
            'enterprise': TenantLimits(
                max_concurrent_requests=200,
                max_db_connections=100,
                max_cpu_seconds=3600,
                max_memory_mb=8192,
                api_rate_limit=10000
            )
        }

        return limits.get(tier, limits['free'])

    async def acquire_resource(self, tenant_id: str):
        """Acquire semaphore for tenant request"""
        if tenant_id not in self.semaphores:
            limits = self.get_tenant_limits(tenant_id)
            self.semaphores[tenant_id] = Semaphore(
                limits.max_concurrent_requests
            )

        return await self.semaphores[tenant_id].acquire()

# API Rate Limiting
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

def get_tenant_from_request():
    """Extract tenant ID for rate limiting"""
    token = request.headers.get('Authorization', '')
    if token:
        try:
            payload = jwt.decode(token.replace('Bearer ', ''), 
                                SECRET_KEY, algorithms=['HS256'])
            return payload.get('tenant_id', get_remote_address())
        except:
            pass
    return get_remote_address()

limiter = Limiter(
    app,
    key_func=get_tenant_from_request,
    default_limits=["100 per minute"],  # Free tier
    storage_uri="redis://redis-cluster.aws.internal:6379"
)

# Apply different limits per tier
@app.route('/api/data')
@limiter.limit("1000 per minute", 
               key_func=lambda: get_tenant_from_request() 
               if is_pro_tier() else None)
@limiter.limit("10000 per minute", 
               key_func=lambda: get_tenant_from_request() 
               if is_enterprise_tier() else None)
def get_data():
    pass

5. Security & Compliance Implementation

# security/tenant_security.py
import hashlib
import boto3
from cryptography.fernet import Fernet
from typing import Dict

class TenantSecurity:
    def __init__(self):
        self.kms = boto3.client('kms')
        self.tenant_keys: Dict[str, bytes] = {}

    def get_tenant_encryption_key(self, tenant_id: str) -> bytes:
        """Get or create tenant-specific encryption key"""
        if tenant_id not in self.tenant_keys:
            # Generate tenant-specific key using KMS
            response = self.kms.generate_data_key(
                KeyId='arn:aws:kms:us-east-1:xxx:key/xxx',
                KeySpec='AES_256',
                EncryptionContext={
                    'tenant_id': tenant_id,
                    'purpose': 'tenant_data_encryption'
                }
            )

            self.tenant_keys[tenant_id] = response['Plaintext']

            # Store encrypted key in database
            self.store_encrypted_key(
                tenant_id, 
                response['CiphertextBlob']
            )

        return self.tenant_keys[tenant_id]

    def encrypt_tenant_data(self, tenant_id: str, data: str) -> str:
        """Encrypt data with tenant-specific key"""
        key = self.get_tenant_encryption_key(tenant_id)
        f = Fernet(key)
        return f.encrypt(data.encode()).decode()

    def audit_log(self, tenant_id: str, action: str, 
                  resource: str, user_id: str):
        """Tenant-specific audit logging"""
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'tenant_id': tenant_id,
            'user_id': user_id,
            'action': action,
            'resource': resource,
            'ip_address': request.remote_addr,
            'user_agent': request.headers.get('User-Agent')
        }

        # Write to tenant-specific CloudWatch log stream
        cloudwatch = boto3.client('logs')
        cloudwatch.put_log_events(
            logGroupName='/aws/saas/audit',
            logStreamName=f'tenant-{tenant_id}',
            logEvents=[{
                'timestamp': int(time.time() * 1000),
                'message': json.dumps(log_entry)
            }]
        )

6. Monitoring & Alerting Per Tenant

# monitoring/tenant_metrics.py
import boto3
from dataclasses import dataclass
from typing import List
import time

@dataclass
class TenantMetrics:
    tenant_id: str
    api_requests: int
    api_errors: int
    db_queries: int
    cache_hits: int
    cache_misses: int
    response_time_ms: float
    cpu_usage: float
    memory_usage_mb: int

class TenantMonitoring:
    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')
        self.metrics_buffer: List[TenantMetrics] = []

    def record_metric(self, tenant_id: str, metric_name: str, 
                     value: float, unit: str = 'Count'):
        """Record tenant-specific metric"""
        self.cloudwatch.put_metric_data(
            Namespace='SaaS/Tenant',
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Dimensions': [
                        {
                            'Name': 'TenantId',
                            'Value': tenant_id
                        },
                        {
                            'Name': 'Tier',
                            'Value': self.get_tenant_tier(tenant_id)
                        }
                    ],
                    'Value': value,
                    'Unit': unit,
                    'Timestamp': time.time()
                }
            ]
        )

    def create_tenant_alarms(self, tenant_id: str):
        """Create CloudWatch alarms per tenant"""
        tier = self.get_tenant_tier(tenant_id)

        # Different thresholds per tier
        thresholds = {
            'free': {'error_rate': 0.05, 'response_time': 1000},
            'pro': {'error_rate': 0.02, 'response_time': 500},
            'enterprise': {'error_rate': 0.01, 'response_time': 200}
        }

        threshold = thresholds.get(tier, thresholds['free'])

        # Error rate alarm
        self.cloudwatch.put_metric_alarm(
            AlarmName=f'tenant-{tenant_id}-high-error-rate',
            ComparisonOperator='GreaterThanThreshold',
            EvaluationPeriods=2,
            MetricName='ErrorRate',
            Namespace='SaaS/Tenant',
            Period=300,
            Statistic='Average',
            Threshold=threshold['error_rate'],
            ActionsEnabled=True,
            AlarmActions=[SNS_TOPIC_ARN],
            AlarmDescription=f'High error rate for tenant {tenant_id}'
        )

Migration Strategy: From Multi-Instance to Multi-Tenant

Phase 1: Data Migration (Month 1-2)

# migration/tenant_migration.py

def migrate_customer_to_multitenant(customer_id: str, 
                                    instance_db_url: str):
    """Migrate single customer from instance to multi-tenant"""

    # 1. Generate tenant UUID
    tenant_id = str(uuid.uuid4())

    # 2. Connect to customer's instance database
    source_conn = psycopg2.connect(instance_db_url)

    # 3. Connect to multi-tenant database
    target_conn = psycopg2.connect(MULTITENANT_DB_URL)

    # 4. Migrate schema with tenant_id
    tables = ['users', 'products', 'orders', 'payments']

    for table in tables:
        # Read from instance
        df = pd.read_sql(f"SELECT * FROM {table}", source_conn)

        # Add tenant_id column
        df['tenant_id'] = tenant_id

        # Write to multi-tenant
        df.to_sql(table, target_conn, if_exists='append', 
                 index=False, method='multi')

    # 5. Migrate Redis data
    migrate_redis_data(customer_id, tenant_id)

    # 6. Update customer record
    update_customer_tenant_mapping(customer_id, tenant_id)

    return tenant_id

# Parallel migration script
from concurrent.futures import ThreadPoolExecutor

def batch_migration():
    customers = get_all_customers()

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = []
        for customer in customers:
            future = executor.submit(
                migrate_customer_to_multitenant,
                customer['id'],
                customer['instance_db_url']
            )
            futures.append(future)

        # Track progress
        for future in as_completed(futures):
            result = future.result()
            print(f"Migrated customer to tenant {result}")

Phase 2: Application Cutover (Month 3-4)

Cutover Strategy:
  1. Deploy multi-tenant application
  2. Configure load balancer for gradual migration
  3. Route customers by tier:
     - Start with free tier (low risk)
     - Move to pro tier
     - Finally migrate enterprise
  4. Monitor performance and errors
  5. Rollback capability per tenant

Phase 3: Infrastructure Cleanup (Month 5-6)

#!/bin/bash
# cleanup_instances.sh

# List all customer instances
INSTANCES=$(aws ec2 describe-instances \
  --filters "Name=tag:Type,Values=customer-instance" \
  --query "Reservations[].Instances[].InstanceId" \
  --output text)

# Terminate instances after verification
for INSTANCE_ID in $INSTANCES; do
  CUSTOMER_ID=$(aws ec2 describe-tags \
    --filters "Name=resource-id,Values=$INSTANCE_ID" \
              "Name=key,Values=CustomerId" \
    --query "Tags[0].Value" --output text)

  # Verify customer is migrated
  if verify_customer_migrated "$CUSTOMER_ID"; then
    echo "Terminating instance $INSTANCE_ID for customer $CUSTOMER_ID"
    aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"
  fi
done

# Clean up RDS instances
# Clean up Redis clusters
# Remove unused VPC resources

Performance Improvements

Despite consolidation, performance actually improved:

Response Time Comparison

Multi-Instance (p50/p95/p99):
- API Latency: 250ms / 800ms / 1500ms
- Database Query: 50ms / 200ms / 500ms

Multi-Tenant (p50/p95/p99):
- API Latency: 120ms / 350ms / 600ms  (52% faster p50)
- Database Query: 20ms / 80ms / 150ms  (60% faster p50)

Why Performance Improved:

Better Resource Utilization: Larger instances with consistent CPU performance
Optimized Connection Pooling: PgBouncer reduced connection overhead
Shared Cache Benefits: Higher cache hit rates with shared Redis
Aurora Performance: Better than individual RDS instances
Reduced Network Hops: Single VPC, less inter-AZ traffic

Lessons Learned

What Worked Well:

Row-Level Security: PostgreSQL RLS provided bulletproof isolation
Gradual Migration: Customer-by-customer approach minimized risk
Tier-Based Limits: Different resource limits per customer tier
Shared Caching: 85% cache hit rate vs 40% in isolated instances

Challenges We Faced:

Noisy Neighbor: One customer running heavy queries affected others
- Solution: Statement timeouts and resource governors
Compliance Concerns: Some enterprise customers required isolation
- Solution: Hybrid model - keep top 10 enterprises on dedicated
Migration Complexity: Data migration took longer than expected
- Solution: Built automated migration tools and parallel processing

What We'd Do Differently:

Start with multi-tenant from day one for new products
Implement better resource isolation earlier
Build migration tools before starting
Keep better metrics on per-customer resource usage

Security & Compliance Maintained

Despite shared infrastructure, we maintained security certifications:

✅ SOC 2 Type II: Passed with zero findings
✅ HIPAA Compliant: With proper BAA and encryption
✅ GDPR Ready: Data isolation and right-to-deletion implemented
✅ PCI DSS: Tokenization and proper segmentation

Key security features:

Tenant data encryption with unique keys
Complete audit logging per tenant
Data residency controls
Automated compliance reporting

The Hybrid Approach: Best of Both Worlds

We kept 10 enterprise customers on dedicated instances:

Hybrid Architecture:
  Multi-Tenant (190 customers):
    - Shared infrastructure
    - $8,000/month cost
    - Automated management

  Multi-Instance (10 enterprises):
    - Dedicated resources
    - $2,000/month cost
    - Premium pricing justified
    - Compliance requirements met

ROI Analysis

Investment:
- Engineering time: 3 engineers × 6 months = $180,000
- Migration tools & automation: $20,000
- Monitoring & security tools: $30,000
Total Investment: $230,000

Returns:
- Monthly savings: $35,000
- Annual savings: $420,000
- Payback period: 6.6 months
- 3-year savings: $1,260,000

Additional Benefits:
- Reduced operational overhead (5 hours/week saved)
- Faster feature deployment (1 deployment vs 200)
- Better resource utilization (70% vs 20%)
- Improved system reliability (99.99% vs 99.9%)

Action Items: Your Migration Checklist

If you're considering multi-tenant architecture:

Analyze Current Costs
- Document per-customer infrastructure costs
- Identify resource utilization patterns
- Calculate potential savings
Design Tenant Isolation
- Choose isolation strategy (database, schema, row-level)
- Implement proper authentication/authorization
- Plan resource limits per tenant
Build Migration Tools
- Automated data migration scripts
- Rollback procedures
- Performance testing framework
Implement Gradually
- Start with non-critical customers
- Monitor performance closely
- Keep rollback options ready
Maintain Security
- Implement encryption per tenant
- Audit logging and monitoring
- Regular security assessments

Conclusion

Multi-tenant architecture isn't just about cost savings - it's about building a sustainable, scalable SaaS platform. Our 78% cost reduction enabled us to:

Invest more in product development
Offer competitive pricing
Improve system performance
Scale to 500+ customers without linear cost growth

The key is maintaining security and performance while consolidating resources. With proper planning and implementation, multi-tenant architecture can transform your SaaS economics without compromising on quality.