Running a SaaS platform with 200+ enterprise customers, we were bleeding $45,000/month on AWS infrastructure. Each customer had their own isolated instance - separate EC2s, RDS databases, and Redis clusters. While this gave us perfect isolation and simplified billing, our AWS bill was becoming unsustainable.
After a 6-month migration to a hybrid multi-tenant architecture, we've reduced costs to $10,000/month (78% reduction) while actually improving performance and maintaining enterprise-grade security. Here's exactly how we did it.
The Numbers That Matter
Before: Multi-Instance Architecture
Monthly Costs (200 customers):
- EC2 instances (200 × t3.medium): $6,000
- RDS PostgreSQL (200 × db.t3.small): $14,000
- Redis clusters (200 × cache.t3.micro): $5,000
- ALB + Target Groups (200 × $20): $4,000
- NAT Gateways (200 × $45): $9,000
- CloudWatch + Logs: $3,000
- Data Transfer: $4,000
Total: $45,000/month ($540,000/year)
After: Multi-Tenant Architecture
Monthly Costs (200 customers):
- EC2 instances (6 × m5.2xlarge): $1,400
- Aurora PostgreSQL Cluster: $2,500
- Redis Cluster (3 nodes): $800
- Single ALB: $25
- NAT Gateway (1): $45
- CloudWatch + Enhanced Monitoring: $800
- Data Transfer (optimized): $500
- Security & Compliance Tools: $930
- Backup & DR: $2,000
Total: $10,000/month ($120,000/year)
Savings: $35,000/month (78% reduction)
Architecture Comparison
Multi-Instance (Original)
Customer_A:
- Dedicated EC2 instance
- Dedicated RDS database
- Dedicated Redis cache
- Isolated VPC subnet
- Separate CloudWatch namespace
Customer_B:
- [Same isolated stack]
Problem: 200× infrastructure overhead
Multi-Tenant (New)
Shared_Infrastructure:
Application_Tier:
- 6 × m5.2xlarge EC2 (Auto Scaling 4-8)
- Shared across all tenants
- Tenant isolation via JWT tokens
Database_Tier:
- Aurora PostgreSQL cluster
- Row-Level Security (RLS)
- Tenant-specific schemas
- Connection pooling (PgBouncer)
Cache_Tier:
- Redis Cluster (3 nodes)
- Keyspace separation (tenant:{id}:*)
- LRU eviction per tenant limits
Implementation: The Technical Details
1. Database Multi-Tenancy with PostgreSQL RLS
-- Enable Row Level Security
ALTER TABLE products ENABLE ROW LEVEL SECURITY;
-- Add tenant_id to all tables
ALTER TABLE products ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_products_tenant ON products(tenant_id);
-- Create security policy
CREATE POLICY tenant_isolation ON products
FOR ALL
TO application_role
USING (tenant_id = current_setting('app.current_tenant')::UUID);
-- Set tenant context in application
SET LOCAL app.current_tenant = '123e4567-e89b-12d3-a456-426614174000';
2. Application Layer Tenant Isolation
# middleware/tenant_isolation.py
import jwt
from functools import wraps
from flask import request, g
import psycopg2.pool
class TenantIsolation:
def __init__(self):
self.pool = psycopg2.pool.ThreadedConnectionPool(
minconn=20,
maxconn=100,
host=os.getenv('DB_HOST'),
database=os.getenv('DB_NAME')
)
def get_tenant_connection(self, tenant_id):
conn = self.pool.getconn()
cursor = conn.cursor()
# Set tenant context for RLS
cursor.execute(
"SET LOCAL app.current_tenant = %s",
(tenant_id,)
)
# Set statement timeout per tenant tier
timeout = self.get_tenant_timeout(tenant_id)
cursor.execute(
"SET LOCAL statement_timeout = %s",
(timeout,)
)
return conn
def require_tenant(f):
@wraps(f)
def decorated_function(*args, **kwargs):
# Extract tenant from JWT
token = request.headers.get('Authorization', '').replace('Bearer ', '')
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
tenant_id = payload['tenant_id']
g.tenant_id = tenant_id
g.db_conn = tenant_isolation.get_tenant_connection(tenant_id)
except jwt.InvalidTokenError:
return {'error': 'Invalid token'}, 401
return f(*args, **kwargs)
return decorated_function
# Usage in routes
@app.route('/api/products')
@require_tenant
def get_products():
# Queries automatically filtered by tenant
cursor = g.db_conn.cursor()
cursor.execute("SELECT * FROM products") # RLS handles filtering
return cursor.fetchall()
3. Redis Multi-Tenant Caching Strategy
# cache/tenant_cache.py
import redis
from typing import Optional, Any
import json
class TenantCache:
def __init__(self):
self.redis = redis.Redis(
host='redis-cluster.aws.internal',
port=6379,
decode_responses=True,
connection_pool_kwargs={
'max_connections': 100,
'socket_keepalive': True
}
)
def get_tenant_key(self, tenant_id: str, key: str) -> str:
"""Generate tenant-specific key"""
return f"tenant:{tenant_id}:{key}"
def get(self, tenant_id: str, key: str) -> Optional[Any]:
full_key = self.get_tenant_key(tenant_id, key)
value = self.redis.get(full_key)
return json.loads(value) if value else None
def set(self, tenant_id: str, key: str, value: Any,
ttl: int = 3600) -> None:
full_key = self.get_tenant_key(tenant_id, key)
# Check tenant's cache quota
if not self.check_tenant_quota(tenant_id):
# Evict oldest tenant keys if quota exceeded
self.evict_tenant_keys(tenant_id)
self.redis.setex(
full_key,
ttl,
json.dumps(value)
)
def check_tenant_quota(self, tenant_id: str) -> bool:
"""Check if tenant is within cache quota"""
pattern = f"tenant:{tenant_id}:*"
keys = self.redis.scan_iter(match=pattern, count=100)
count = sum(1 for _ in keys)
# Different quotas per tier
tier = self.get_tenant_tier(tenant_id)
quotas = {
'free': 1000,
'pro': 10000,
'enterprise': 100000
}
return count < quotas.get(tier, 1000)
4. Performance Isolation & Resource Limits
# resource_management/tenant_limits.py
from dataclasses import dataclass
from typing import Dict
import asyncio
from asyncio import Semaphore
@dataclass
class TenantLimits:
max_concurrent_requests: int
max_db_connections: int
max_cpu_seconds: int
max_memory_mb: int
api_rate_limit: int # requests per minute
class ResourceManager:
def __init__(self):
self.tenant_limits: Dict[str, TenantLimits] = {}
self.semaphores: Dict[str, Semaphore] = {}
def get_tenant_limits(self, tenant_id: str) -> TenantLimits:
tier = self.get_tenant_tier(tenant_id)
limits = {
'free': TenantLimits(
max_concurrent_requests=10,
max_db_connections=5,
max_cpu_seconds=60,
max_memory_mb=512,
api_rate_limit=100
),
'pro': TenantLimits(
max_concurrent_requests=50,
max_db_connections=20,
max_cpu_seconds=300,
max_memory_mb=2048,
api_rate_limit=1000
),
'enterprise': TenantLimits(
max_concurrent_requests=200,
max_db_connections=100,
max_cpu_seconds=3600,
max_memory_mb=8192,
api_rate_limit=10000
)
}
return limits.get(tier, limits['free'])
async def acquire_resource(self, tenant_id: str):
"""Acquire semaphore for tenant request"""
if tenant_id not in self.semaphores:
limits = self.get_tenant_limits(tenant_id)
self.semaphores[tenant_id] = Semaphore(
limits.max_concurrent_requests
)
return await self.semaphores[tenant_id].acquire()
# API Rate Limiting
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
def get_tenant_from_request():
"""Extract tenant ID for rate limiting"""
token = request.headers.get('Authorization', '')
if token:
try:
payload = jwt.decode(token.replace('Bearer ', ''),
SECRET_KEY, algorithms=['HS256'])
return payload.get('tenant_id', get_remote_address())
except:
pass
return get_remote_address()
limiter = Limiter(
app,
key_func=get_tenant_from_request,
default_limits=["100 per minute"], # Free tier
storage_uri="redis://redis-cluster.aws.internal:6379"
)
# Apply different limits per tier
@app.route('/api/data')
@limiter.limit("1000 per minute",
key_func=lambda: get_tenant_from_request()
if is_pro_tier() else None)
@limiter.limit("10000 per minute",
key_func=lambda: get_tenant_from_request()
if is_enterprise_tier() else None)
def get_data():
pass
5. Security & Compliance Implementation
# security/tenant_security.py
import hashlib
import boto3
from cryptography.fernet import Fernet
from typing import Dict
class TenantSecurity:
def __init__(self):
self.kms = boto3.client('kms')
self.tenant_keys: Dict[str, bytes] = {}
def get_tenant_encryption_key(self, tenant_id: str) -> bytes:
"""Get or create tenant-specific encryption key"""
if tenant_id not in self.tenant_keys:
# Generate tenant-specific key using KMS
response = self.kms.generate_data_key(
KeyId='arn:aws:kms:us-east-1:xxx:key/xxx',
KeySpec='AES_256',
EncryptionContext={
'tenant_id': tenant_id,
'purpose': 'tenant_data_encryption'
}
)
self.tenant_keys[tenant_id] = response['Plaintext']
# Store encrypted key in database
self.store_encrypted_key(
tenant_id,
response['CiphertextBlob']
)
return self.tenant_keys[tenant_id]
def encrypt_tenant_data(self, tenant_id: str, data: str) -> str:
"""Encrypt data with tenant-specific key"""
key = self.get_tenant_encryption_key(tenant_id)
f = Fernet(key)
return f.encrypt(data.encode()).decode()
def audit_log(self, tenant_id: str, action: str,
resource: str, user_id: str):
"""Tenant-specific audit logging"""
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'tenant_id': tenant_id,
'user_id': user_id,
'action': action,
'resource': resource,
'ip_address': request.remote_addr,
'user_agent': request.headers.get('User-Agent')
}
# Write to tenant-specific CloudWatch log stream
cloudwatch = boto3.client('logs')
cloudwatch.put_log_events(
logGroupName='/aws/saas/audit',
logStreamName=f'tenant-{tenant_id}',
logEvents=[{
'timestamp': int(time.time() * 1000),
'message': json.dumps(log_entry)
}]
)
6. Monitoring & Alerting Per Tenant
# monitoring/tenant_metrics.py
import boto3
from dataclasses import dataclass
from typing import List
import time
@dataclass
class TenantMetrics:
tenant_id: str
api_requests: int
api_errors: int
db_queries: int
cache_hits: int
cache_misses: int
response_time_ms: float
cpu_usage: float
memory_usage_mb: int
class TenantMonitoring:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.metrics_buffer: List[TenantMetrics] = []
def record_metric(self, tenant_id: str, metric_name: str,
value: float, unit: str = 'Count'):
"""Record tenant-specific metric"""
self.cloudwatch.put_metric_data(
Namespace='SaaS/Tenant',
MetricData=[
{
'MetricName': metric_name,
'Dimensions': [
{
'Name': 'TenantId',
'Value': tenant_id
},
{
'Name': 'Tier',
'Value': self.get_tenant_tier(tenant_id)
}
],
'Value': value,
'Unit': unit,
'Timestamp': time.time()
}
]
)
def create_tenant_alarms(self, tenant_id: str):
"""Create CloudWatch alarms per tenant"""
tier = self.get_tenant_tier(tenant_id)
# Different thresholds per tier
thresholds = {
'free': {'error_rate': 0.05, 'response_time': 1000},
'pro': {'error_rate': 0.02, 'response_time': 500},
'enterprise': {'error_rate': 0.01, 'response_time': 200}
}
threshold = thresholds.get(tier, thresholds['free'])
# Error rate alarm
self.cloudwatch.put_metric_alarm(
AlarmName=f'tenant-{tenant_id}-high-error-rate',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
MetricName='ErrorRate',
Namespace='SaaS/Tenant',
Period=300,
Statistic='Average',
Threshold=threshold['error_rate'],
ActionsEnabled=True,
AlarmActions=[SNS_TOPIC_ARN],
AlarmDescription=f'High error rate for tenant {tenant_id}'
)
Migration Strategy: From Multi-Instance to Multi-Tenant
Phase 1: Data Migration (Month 1-2)
# migration/tenant_migration.py
def migrate_customer_to_multitenant(customer_id: str,
instance_db_url: str):
"""Migrate single customer from instance to multi-tenant"""
# 1. Generate tenant UUID
tenant_id = str(uuid.uuid4())
# 2. Connect to customer's instance database
source_conn = psycopg2.connect(instance_db_url)
# 3. Connect to multi-tenant database
target_conn = psycopg2.connect(MULTITENANT_DB_URL)
# 4. Migrate schema with tenant_id
tables = ['users', 'products', 'orders', 'payments']
for table in tables:
# Read from instance
df = pd.read_sql(f"SELECT * FROM {table}", source_conn)
# Add tenant_id column
df['tenant_id'] = tenant_id
# Write to multi-tenant
df.to_sql(table, target_conn, if_exists='append',
index=False, method='multi')
# 5. Migrate Redis data
migrate_redis_data(customer_id, tenant_id)
# 6. Update customer record
update_customer_tenant_mapping(customer_id, tenant_id)
return tenant_id
# Parallel migration script
from concurrent.futures import ThreadPoolExecutor
def batch_migration():
customers = get_all_customers()
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for customer in customers:
future = executor.submit(
migrate_customer_to_multitenant,
customer['id'],
customer['instance_db_url']
)
futures.append(future)
# Track progress
for future in as_completed(futures):
result = future.result()
print(f"Migrated customer to tenant {result}")
Phase 2: Application Cutover (Month 3-4)
Cutover Strategy:
1. Deploy multi-tenant application
2. Configure load balancer for gradual migration
3. Route customers by tier:
- Start with free tier (low risk)
- Move to pro tier
- Finally migrate enterprise
4. Monitor performance and errors
5. Rollback capability per tenant
Phase 3: Infrastructure Cleanup (Month 5-6)
#!/bin/bash
# cleanup_instances.sh
# List all customer instances
INSTANCES=$(aws ec2 describe-instances \
--filters "Name=tag:Type,Values=customer-instance" \
--query "Reservations[].Instances[].InstanceId" \
--output text)
# Terminate instances after verification
for INSTANCE_ID in $INSTANCES; do
CUSTOMER_ID=$(aws ec2 describe-tags \
--filters "Name=resource-id,Values=$INSTANCE_ID" \
"Name=key,Values=CustomerId" \
--query "Tags[0].Value" --output text)
# Verify customer is migrated
if verify_customer_migrated "$CUSTOMER_ID"; then
echo "Terminating instance $INSTANCE_ID for customer $CUSTOMER_ID"
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"
fi
done
# Clean up RDS instances
# Clean up Redis clusters
# Remove unused VPC resources
Performance Improvements
Despite consolidation, performance actually improved:
Response Time Comparison
Multi-Instance (p50/p95/p99):
- API Latency: 250ms / 800ms / 1500ms
- Database Query: 50ms / 200ms / 500ms
Multi-Tenant (p50/p95/p99):
- API Latency: 120ms / 350ms / 600ms (52% faster p50)
- Database Query: 20ms / 80ms / 150ms (60% faster p50)
Why Performance Improved:
- Better Resource Utilization: Larger instances with consistent CPU performance
- Optimized Connection Pooling: PgBouncer reduced connection overhead
- Shared Cache Benefits: Higher cache hit rates with shared Redis
- Aurora Performance: Better than individual RDS instances
- Reduced Network Hops: Single VPC, less inter-AZ traffic
Lessons Learned
What Worked Well:
- Row-Level Security: PostgreSQL RLS provided bulletproof isolation
- Gradual Migration: Customer-by-customer approach minimized risk
- Tier-Based Limits: Different resource limits per customer tier
- Shared Caching: 85% cache hit rate vs 40% in isolated instances
Challenges We Faced:
-
Noisy Neighbor: One customer running heavy queries affected others
- Solution: Statement timeouts and resource governors
-
Compliance Concerns: Some enterprise customers required isolation
- Solution: Hybrid model - keep top 10 enterprises on dedicated
-
Migration Complexity: Data migration took longer than expected
- Solution: Built automated migration tools and parallel processing
What We'd Do Differently:
- Start with multi-tenant from day one for new products
- Implement better resource isolation earlier
- Build migration tools before starting
- Keep better metrics on per-customer resource usage
Security & Compliance Maintained
Despite shared infrastructure, we maintained security certifications:
- ✅ SOC 2 Type II: Passed with zero findings
- ✅ HIPAA Compliant: With proper BAA and encryption
- ✅ GDPR Ready: Data isolation and right-to-deletion implemented
- ✅ PCI DSS: Tokenization and proper segmentation
Key security features:
- Tenant data encryption with unique keys
- Complete audit logging per tenant
- Data residency controls
- Automated compliance reporting
The Hybrid Approach: Best of Both Worlds
We kept 10 enterprise customers on dedicated instances:
Hybrid Architecture:
Multi-Tenant (190 customers):
- Shared infrastructure
- $8,000/month cost
- Automated management
Multi-Instance (10 enterprises):
- Dedicated resources
- $2,000/month cost
- Premium pricing justified
- Compliance requirements met
ROI Analysis
Investment:
- Engineering time: 3 engineers × 6 months = $180,000
- Migration tools & automation: $20,000
- Monitoring & security tools: $30,000
Total Investment: $230,000
Returns:
- Monthly savings: $35,000
- Annual savings: $420,000
- Payback period: 6.6 months
- 3-year savings: $1,260,000
Additional Benefits:
- Reduced operational overhead (5 hours/week saved)
- Faster feature deployment (1 deployment vs 200)
- Better resource utilization (70% vs 20%)
- Improved system reliability (99.99% vs 99.9%)
Action Items: Your Migration Checklist
If you're considering multi-tenant architecture:
-
Analyze Current Costs
- Document per-customer infrastructure costs
- Identify resource utilization patterns
- Calculate potential savings
-
Design Tenant Isolation
- Choose isolation strategy (database, schema, row-level)
- Implement proper authentication/authorization
- Plan resource limits per tenant
-
Build Migration Tools
- Automated data migration scripts
- Rollback procedures
- Performance testing framework
-
Implement Gradually
- Start with non-critical customers
- Monitor performance closely
- Keep rollback options ready
-
Maintain Security
- Implement encryption per tenant
- Audit logging and monitoring
- Regular security assessments
Conclusion
Multi-tenant architecture isn't just about cost savings - it's about building a sustainable, scalable SaaS platform. Our 78% cost reduction enabled us to:
- Invest more in product development
- Offer competitive pricing
- Improve system performance
- Scale to 500+ customers without linear cost growth
The key is maintaining security and performance while consolidating resources. With proper planning and implementation, multi-tenant architecture can transform your SaaS economics without compromising on quality.
Top comments (0)