DEV Community

Garrett Yan
Garrett Yan

Posted on

Multi-Tenant vs Multi-Instance: How We Cut SaaS Infrastructure Costs by 78% Without Sacrificing Security

Running a SaaS platform with 200+ enterprise customers, we were bleeding $45,000/month on AWS infrastructure. Each customer had their own isolated instance - separate EC2s, RDS databases, and Redis clusters. While this gave us perfect isolation and simplified billing, our AWS bill was becoming unsustainable.

After a 6-month migration to a hybrid multi-tenant architecture, we've reduced costs to $10,000/month (78% reduction) while actually improving performance and maintaining enterprise-grade security. Here's exactly how we did it.

The Numbers That Matter

Before: Multi-Instance Architecture

Monthly Costs (200 customers):
- EC2 instances (200 × t3.medium):        $6,000
- RDS PostgreSQL (200 × db.t3.small):    $14,000  
- Redis clusters (200 × cache.t3.micro):  $5,000
- ALB + Target Groups (200 × $20):        $4,000
- NAT Gateways (200 × $45):              $9,000
- CloudWatch + Logs:                      $3,000
- Data Transfer:                           $4,000

Total: $45,000/month ($540,000/year)
Enter fullscreen mode Exit fullscreen mode

After: Multi-Tenant Architecture

Monthly Costs (200 customers):
- EC2 instances (6 × m5.2xlarge):         $1,400
- Aurora PostgreSQL Cluster:              $2,500
- Redis Cluster (3 nodes):                $800
- Single ALB:                             $25
- NAT Gateway (1):                        $45
- CloudWatch + Enhanced Monitoring:       $800
- Data Transfer (optimized):              $500
- Security & Compliance Tools:            $930
- Backup & DR:                            $2,000

Total: $10,000/month ($120,000/year)
Savings: $35,000/month (78% reduction)
Enter fullscreen mode Exit fullscreen mode

Architecture Comparison

Multi-Instance (Original)

Customer_A:
  - Dedicated EC2 instance
  - Dedicated RDS database
  - Dedicated Redis cache
  - Isolated VPC subnet
  - Separate CloudWatch namespace

Customer_B:
  - [Same isolated stack]

Problem: 200× infrastructure overhead
Enter fullscreen mode Exit fullscreen mode

Multi-Tenant (New)

Shared_Infrastructure:
  Application_Tier:
    - 6 × m5.2xlarge EC2 (Auto Scaling 4-8)
    - Shared across all tenants
    - Tenant isolation via JWT tokens

  Database_Tier:
    - Aurora PostgreSQL cluster
    - Row-Level Security (RLS)
    - Tenant-specific schemas
    - Connection pooling (PgBouncer)

  Cache_Tier:
    - Redis Cluster (3 nodes)
    - Keyspace separation (tenant:{id}:*)
    - LRU eviction per tenant limits
Enter fullscreen mode Exit fullscreen mode

Implementation: The Technical Details

1. Database Multi-Tenancy with PostgreSQL RLS

-- Enable Row Level Security
ALTER TABLE products ENABLE ROW LEVEL SECURITY;

-- Add tenant_id to all tables
ALTER TABLE products ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_products_tenant ON products(tenant_id);

-- Create security policy
CREATE POLICY tenant_isolation ON products
    FOR ALL
    TO application_role
    USING (tenant_id = current_setting('app.current_tenant')::UUID);

-- Set tenant context in application
SET LOCAL app.current_tenant = '123e4567-e89b-12d3-a456-426614174000';
Enter fullscreen mode Exit fullscreen mode

2. Application Layer Tenant Isolation

# middleware/tenant_isolation.py
import jwt
from functools import wraps
from flask import request, g
import psycopg2.pool

class TenantIsolation:
    def __init__(self):
        self.pool = psycopg2.pool.ThreadedConnectionPool(
            minconn=20, 
            maxconn=100,
            host=os.getenv('DB_HOST'),
            database=os.getenv('DB_NAME')
        )

    def get_tenant_connection(self, tenant_id):
        conn = self.pool.getconn()
        cursor = conn.cursor()

        # Set tenant context for RLS
        cursor.execute(
            "SET LOCAL app.current_tenant = %s",
            (tenant_id,)
        )

        # Set statement timeout per tenant tier
        timeout = self.get_tenant_timeout(tenant_id)
        cursor.execute(
            "SET LOCAL statement_timeout = %s",
            (timeout,)
        )

        return conn

def require_tenant(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        # Extract tenant from JWT
        token = request.headers.get('Authorization', '').replace('Bearer ', '')

        try:
            payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
            tenant_id = payload['tenant_id']
            g.tenant_id = tenant_id
            g.db_conn = tenant_isolation.get_tenant_connection(tenant_id)
        except jwt.InvalidTokenError:
            return {'error': 'Invalid token'}, 401

        return f(*args, **kwargs)
    return decorated_function

# Usage in routes
@app.route('/api/products')
@require_tenant
def get_products():
    # Queries automatically filtered by tenant
    cursor = g.db_conn.cursor()
    cursor.execute("SELECT * FROM products")  # RLS handles filtering
    return cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode

3. Redis Multi-Tenant Caching Strategy

# cache/tenant_cache.py
import redis
from typing import Optional, Any
import json

class TenantCache:
    def __init__(self):
        self.redis = redis.Redis(
            host='redis-cluster.aws.internal',
            port=6379,
            decode_responses=True,
            connection_pool_kwargs={
                'max_connections': 100,
                'socket_keepalive': True
            }
        )

    def get_tenant_key(self, tenant_id: str, key: str) -> str:
        """Generate tenant-specific key"""
        return f"tenant:{tenant_id}:{key}"

    def get(self, tenant_id: str, key: str) -> Optional[Any]:
        full_key = self.get_tenant_key(tenant_id, key)
        value = self.redis.get(full_key)
        return json.loads(value) if value else None

    def set(self, tenant_id: str, key: str, value: Any, 
            ttl: int = 3600) -> None:
        full_key = self.get_tenant_key(tenant_id, key)

        # Check tenant's cache quota
        if not self.check_tenant_quota(tenant_id):
            # Evict oldest tenant keys if quota exceeded
            self.evict_tenant_keys(tenant_id)

        self.redis.setex(
            full_key,
            ttl,
            json.dumps(value)
        )

    def check_tenant_quota(self, tenant_id: str) -> bool:
        """Check if tenant is within cache quota"""
        pattern = f"tenant:{tenant_id}:*"
        keys = self.redis.scan_iter(match=pattern, count=100)
        count = sum(1 for _ in keys)

        # Different quotas per tier
        tier = self.get_tenant_tier(tenant_id)
        quotas = {
            'free': 1000,
            'pro': 10000,
            'enterprise': 100000
        }

        return count < quotas.get(tier, 1000)
Enter fullscreen mode Exit fullscreen mode

4. Performance Isolation & Resource Limits

# resource_management/tenant_limits.py
from dataclasses import dataclass
from typing import Dict
import asyncio
from asyncio import Semaphore

@dataclass
class TenantLimits:
    max_concurrent_requests: int
    max_db_connections: int
    max_cpu_seconds: int
    max_memory_mb: int
    api_rate_limit: int  # requests per minute

class ResourceManager:
    def __init__(self):
        self.tenant_limits: Dict[str, TenantLimits] = {}
        self.semaphores: Dict[str, Semaphore] = {}

    def get_tenant_limits(self, tenant_id: str) -> TenantLimits:
        tier = self.get_tenant_tier(tenant_id)

        limits = {
            'free': TenantLimits(
                max_concurrent_requests=10,
                max_db_connections=5,
                max_cpu_seconds=60,
                max_memory_mb=512,
                api_rate_limit=100
            ),
            'pro': TenantLimits(
                max_concurrent_requests=50,
                max_db_connections=20,
                max_cpu_seconds=300,
                max_memory_mb=2048,
                api_rate_limit=1000
            ),
            'enterprise': TenantLimits(
                max_concurrent_requests=200,
                max_db_connections=100,
                max_cpu_seconds=3600,
                max_memory_mb=8192,
                api_rate_limit=10000
            )
        }

        return limits.get(tier, limits['free'])

    async def acquire_resource(self, tenant_id: str):
        """Acquire semaphore for tenant request"""
        if tenant_id not in self.semaphores:
            limits = self.get_tenant_limits(tenant_id)
            self.semaphores[tenant_id] = Semaphore(
                limits.max_concurrent_requests
            )

        return await self.semaphores[tenant_id].acquire()

# API Rate Limiting
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

def get_tenant_from_request():
    """Extract tenant ID for rate limiting"""
    token = request.headers.get('Authorization', '')
    if token:
        try:
            payload = jwt.decode(token.replace('Bearer ', ''), 
                                SECRET_KEY, algorithms=['HS256'])
            return payload.get('tenant_id', get_remote_address())
        except:
            pass
    return get_remote_address()

limiter = Limiter(
    app,
    key_func=get_tenant_from_request,
    default_limits=["100 per minute"],  # Free tier
    storage_uri="redis://redis-cluster.aws.internal:6379"
)

# Apply different limits per tier
@app.route('/api/data')
@limiter.limit("1000 per minute", 
               key_func=lambda: get_tenant_from_request() 
               if is_pro_tier() else None)
@limiter.limit("10000 per minute", 
               key_func=lambda: get_tenant_from_request() 
               if is_enterprise_tier() else None)
def get_data():
    pass
Enter fullscreen mode Exit fullscreen mode

5. Security & Compliance Implementation

# security/tenant_security.py
import hashlib
import boto3
from cryptography.fernet import Fernet
from typing import Dict

class TenantSecurity:
    def __init__(self):
        self.kms = boto3.client('kms')
        self.tenant_keys: Dict[str, bytes] = {}

    def get_tenant_encryption_key(self, tenant_id: str) -> bytes:
        """Get or create tenant-specific encryption key"""
        if tenant_id not in self.tenant_keys:
            # Generate tenant-specific key using KMS
            response = self.kms.generate_data_key(
                KeyId='arn:aws:kms:us-east-1:xxx:key/xxx',
                KeySpec='AES_256',
                EncryptionContext={
                    'tenant_id': tenant_id,
                    'purpose': 'tenant_data_encryption'
                }
            )

            self.tenant_keys[tenant_id] = response['Plaintext']

            # Store encrypted key in database
            self.store_encrypted_key(
                tenant_id, 
                response['CiphertextBlob']
            )

        return self.tenant_keys[tenant_id]

    def encrypt_tenant_data(self, tenant_id: str, data: str) -> str:
        """Encrypt data with tenant-specific key"""
        key = self.get_tenant_encryption_key(tenant_id)
        f = Fernet(key)
        return f.encrypt(data.encode()).decode()

    def audit_log(self, tenant_id: str, action: str, 
                  resource: str, user_id: str):
        """Tenant-specific audit logging"""
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'tenant_id': tenant_id,
            'user_id': user_id,
            'action': action,
            'resource': resource,
            'ip_address': request.remote_addr,
            'user_agent': request.headers.get('User-Agent')
        }

        # Write to tenant-specific CloudWatch log stream
        cloudwatch = boto3.client('logs')
        cloudwatch.put_log_events(
            logGroupName='/aws/saas/audit',
            logStreamName=f'tenant-{tenant_id}',
            logEvents=[{
                'timestamp': int(time.time() * 1000),
                'message': json.dumps(log_entry)
            }]
        )
Enter fullscreen mode Exit fullscreen mode

6. Monitoring & Alerting Per Tenant

# monitoring/tenant_metrics.py
import boto3
from dataclasses import dataclass
from typing import List
import time

@dataclass
class TenantMetrics:
    tenant_id: str
    api_requests: int
    api_errors: int
    db_queries: int
    cache_hits: int
    cache_misses: int
    response_time_ms: float
    cpu_usage: float
    memory_usage_mb: int

class TenantMonitoring:
    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')
        self.metrics_buffer: List[TenantMetrics] = []

    def record_metric(self, tenant_id: str, metric_name: str, 
                     value: float, unit: str = 'Count'):
        """Record tenant-specific metric"""
        self.cloudwatch.put_metric_data(
            Namespace='SaaS/Tenant',
            MetricData=[
                {
                    'MetricName': metric_name,
                    'Dimensions': [
                        {
                            'Name': 'TenantId',
                            'Value': tenant_id
                        },
                        {
                            'Name': 'Tier',
                            'Value': self.get_tenant_tier(tenant_id)
                        }
                    ],
                    'Value': value,
                    'Unit': unit,
                    'Timestamp': time.time()
                }
            ]
        )

    def create_tenant_alarms(self, tenant_id: str):
        """Create CloudWatch alarms per tenant"""
        tier = self.get_tenant_tier(tenant_id)

        # Different thresholds per tier
        thresholds = {
            'free': {'error_rate': 0.05, 'response_time': 1000},
            'pro': {'error_rate': 0.02, 'response_time': 500},
            'enterprise': {'error_rate': 0.01, 'response_time': 200}
        }

        threshold = thresholds.get(tier, thresholds['free'])

        # Error rate alarm
        self.cloudwatch.put_metric_alarm(
            AlarmName=f'tenant-{tenant_id}-high-error-rate',
            ComparisonOperator='GreaterThanThreshold',
            EvaluationPeriods=2,
            MetricName='ErrorRate',
            Namespace='SaaS/Tenant',
            Period=300,
            Statistic='Average',
            Threshold=threshold['error_rate'],
            ActionsEnabled=True,
            AlarmActions=[SNS_TOPIC_ARN],
            AlarmDescription=f'High error rate for tenant {tenant_id}'
        )
Enter fullscreen mode Exit fullscreen mode

Migration Strategy: From Multi-Instance to Multi-Tenant

Phase 1: Data Migration (Month 1-2)

# migration/tenant_migration.py

def migrate_customer_to_multitenant(customer_id: str, 
                                    instance_db_url: str):
    """Migrate single customer from instance to multi-tenant"""

    # 1. Generate tenant UUID
    tenant_id = str(uuid.uuid4())

    # 2. Connect to customer's instance database
    source_conn = psycopg2.connect(instance_db_url)

    # 3. Connect to multi-tenant database
    target_conn = psycopg2.connect(MULTITENANT_DB_URL)

    # 4. Migrate schema with tenant_id
    tables = ['users', 'products', 'orders', 'payments']

    for table in tables:
        # Read from instance
        df = pd.read_sql(f"SELECT * FROM {table}", source_conn)

        # Add tenant_id column
        df['tenant_id'] = tenant_id

        # Write to multi-tenant
        df.to_sql(table, target_conn, if_exists='append', 
                 index=False, method='multi')

    # 5. Migrate Redis data
    migrate_redis_data(customer_id, tenant_id)

    # 6. Update customer record
    update_customer_tenant_mapping(customer_id, tenant_id)

    return tenant_id

# Parallel migration script
from concurrent.futures import ThreadPoolExecutor

def batch_migration():
    customers = get_all_customers()

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = []
        for customer in customers:
            future = executor.submit(
                migrate_customer_to_multitenant,
                customer['id'],
                customer['instance_db_url']
            )
            futures.append(future)

        # Track progress
        for future in as_completed(futures):
            result = future.result()
            print(f"Migrated customer to tenant {result}")
Enter fullscreen mode Exit fullscreen mode

Phase 2: Application Cutover (Month 3-4)

Cutover Strategy:
  1. Deploy multi-tenant application
  2. Configure load balancer for gradual migration
  3. Route customers by tier:
     - Start with free tier (low risk)
     - Move to pro tier
     - Finally migrate enterprise
  4. Monitor performance and errors
  5. Rollback capability per tenant
Enter fullscreen mode Exit fullscreen mode

Phase 3: Infrastructure Cleanup (Month 5-6)

#!/bin/bash
# cleanup_instances.sh

# List all customer instances
INSTANCES=$(aws ec2 describe-instances \
  --filters "Name=tag:Type,Values=customer-instance" \
  --query "Reservations[].Instances[].InstanceId" \
  --output text)

# Terminate instances after verification
for INSTANCE_ID in $INSTANCES; do
  CUSTOMER_ID=$(aws ec2 describe-tags \
    --filters "Name=resource-id,Values=$INSTANCE_ID" \
              "Name=key,Values=CustomerId" \
    --query "Tags[0].Value" --output text)

  # Verify customer is migrated
  if verify_customer_migrated "$CUSTOMER_ID"; then
    echo "Terminating instance $INSTANCE_ID for customer $CUSTOMER_ID"
    aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"
  fi
done

# Clean up RDS instances
# Clean up Redis clusters
# Remove unused VPC resources
Enter fullscreen mode Exit fullscreen mode

Performance Improvements

Despite consolidation, performance actually improved:

Response Time Comparison

Multi-Instance (p50/p95/p99):
- API Latency: 250ms / 800ms / 1500ms
- Database Query: 50ms / 200ms / 500ms

Multi-Tenant (p50/p95/p99):
- API Latency: 120ms / 350ms / 600ms  (52% faster p50)
- Database Query: 20ms / 80ms / 150ms  (60% faster p50)
Enter fullscreen mode Exit fullscreen mode

Why Performance Improved:

  1. Better Resource Utilization: Larger instances with consistent CPU performance
  2. Optimized Connection Pooling: PgBouncer reduced connection overhead
  3. Shared Cache Benefits: Higher cache hit rates with shared Redis
  4. Aurora Performance: Better than individual RDS instances
  5. Reduced Network Hops: Single VPC, less inter-AZ traffic

Lessons Learned

What Worked Well:

  1. Row-Level Security: PostgreSQL RLS provided bulletproof isolation
  2. Gradual Migration: Customer-by-customer approach minimized risk
  3. Tier-Based Limits: Different resource limits per customer tier
  4. Shared Caching: 85% cache hit rate vs 40% in isolated instances

Challenges We Faced:

  1. Noisy Neighbor: One customer running heavy queries affected others
    • Solution: Statement timeouts and resource governors
  2. Compliance Concerns: Some enterprise customers required isolation
    • Solution: Hybrid model - keep top 10 enterprises on dedicated
  3. Migration Complexity: Data migration took longer than expected
    • Solution: Built automated migration tools and parallel processing

What We'd Do Differently:

  1. Start with multi-tenant from day one for new products
  2. Implement better resource isolation earlier
  3. Build migration tools before starting
  4. Keep better metrics on per-customer resource usage

Security & Compliance Maintained

Despite shared infrastructure, we maintained security certifications:

  • SOC 2 Type II: Passed with zero findings
  • HIPAA Compliant: With proper BAA and encryption
  • GDPR Ready: Data isolation and right-to-deletion implemented
  • PCI DSS: Tokenization and proper segmentation

Key security features:

  • Tenant data encryption with unique keys
  • Complete audit logging per tenant
  • Data residency controls
  • Automated compliance reporting

The Hybrid Approach: Best of Both Worlds

We kept 10 enterprise customers on dedicated instances:

Hybrid Architecture:
  Multi-Tenant (190 customers):
    - Shared infrastructure
    - $8,000/month cost
    - Automated management

  Multi-Instance (10 enterprises):
    - Dedicated resources
    - $2,000/month cost
    - Premium pricing justified
    - Compliance requirements met
Enter fullscreen mode Exit fullscreen mode

ROI Analysis

Investment:
- Engineering time: 3 engineers × 6 months = $180,000
- Migration tools & automation: $20,000
- Monitoring & security tools: $30,000
Total Investment: $230,000

Returns:
- Monthly savings: $35,000
- Annual savings: $420,000
- Payback period: 6.6 months
- 3-year savings: $1,260,000

Additional Benefits:
- Reduced operational overhead (5 hours/week saved)
- Faster feature deployment (1 deployment vs 200)
- Better resource utilization (70% vs 20%)
- Improved system reliability (99.99% vs 99.9%)
Enter fullscreen mode Exit fullscreen mode

Action Items: Your Migration Checklist

If you're considering multi-tenant architecture:

  1. Analyze Current Costs

    • Document per-customer infrastructure costs
    • Identify resource utilization patterns
    • Calculate potential savings
  2. Design Tenant Isolation

    • Choose isolation strategy (database, schema, row-level)
    • Implement proper authentication/authorization
    • Plan resource limits per tenant
  3. Build Migration Tools

    • Automated data migration scripts
    • Rollback procedures
    • Performance testing framework
  4. Implement Gradually

    • Start with non-critical customers
    • Monitor performance closely
    • Keep rollback options ready
  5. Maintain Security

    • Implement encryption per tenant
    • Audit logging and monitoring
    • Regular security assessments

Conclusion

Multi-tenant architecture isn't just about cost savings - it's about building a sustainable, scalable SaaS platform. Our 78% cost reduction enabled us to:

  • Invest more in product development
  • Offer competitive pricing
  • Improve system performance
  • Scale to 500+ customers without linear cost growth

The key is maintaining security and performance while consolidating resources. With proper planning and implementation, multi-tenant architecture can transform your SaaS economics without compromising on quality.


Top comments (1)

Collapse
 
emil profile image
Emil

Great read and well written. How do you handle noisy neighbors? So if one tenant consumes way more resources than others. Is rate limiting enough?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.