Running a SaaS platform with 200+ enterprise customers, we were bleeding $45,000/month on AWS infrastructure. Each customer had their own isolated instance - separate EC2s, RDS databases, and Redis clusters. While this gave us perfect isolation and simplified billing, our AWS bill was becoming unsustainable.
After a 6-month migration to a hybrid multi-tenant architecture, we've reduced costs to $10,000/month (78% reduction) while actually improving performance and maintaining enterprise-grade security. Here's exactly how we did it.
The Numbers That Matter
Before: Multi-Instance Architecture
Monthly Costs (200 customers):
- EC2 instances (200 × t3.medium): $6,000
- RDS PostgreSQL (200 × db.t3.small): $14,000
- Redis clusters (200 × cache.t3.micro): $5,000
- ALB + Target Groups (200 × $20): $4,000
- NAT Gateways (200 × $45): $9,000
- CloudWatch + Logs: $3,000
- Data Transfer: $4,000
Total: $45,000/month ($540,000/year)
After: Multi-Tenant Architecture
Monthly Costs (200 customers):
- EC2 instances (6 × m5.2xlarge): $1,400
- Aurora PostgreSQL Cluster: $2,500
- Redis Cluster (3 nodes): $800
- Single ALB: $25
- NAT Gateway (1): $45
- CloudWatch + Enhanced Monitoring: $800
- Data Transfer (optimized): $500
- Security & Compliance Tools: $930
- Backup & DR: $2,000
Total: $10,000/month ($120,000/year)
Savings: $35,000/month (78% reduction)
Architecture Comparison
Multi-Instance (Original)
Customer_A:
- Dedicated EC2 instance
- Dedicated RDS database
- Dedicated Redis cache
- Isolated VPC subnet
- Separate CloudWatch namespace
Customer_B:
- [Same isolated stack]
Problem: 200× infrastructure overhead
Multi-Tenant (New)
Shared_Infrastructure:
Application_Tier:
- 6 × m5.2xlarge EC2 (Auto Scaling 4-8)
- Shared across all tenants
- Tenant isolation via JWT tokens
Database_Tier:
- Aurora PostgreSQL cluster
- Row-Level Security (RLS)
- Tenant-specific schemas
- Connection pooling (PgBouncer)
Cache_Tier:
- Redis Cluster (3 nodes)
- Keyspace separation (tenant:{id}:*)
- LRU eviction per tenant limits
Implementation: The Technical Details
1. Database Multi-Tenancy with PostgreSQL RLS
-- Enable Row Level Security
ALTER TABLE products ENABLE ROW LEVEL SECURITY;
-- Add tenant_id to all tables
ALTER TABLE products ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_products_tenant ON products(tenant_id);
-- Create security policy
CREATE POLICY tenant_isolation ON products
FOR ALL
TO application_role
USING (tenant_id = current_setting('app.current_tenant')::UUID);
-- Set tenant context in application
SET LOCAL app.current_tenant = '123e4567-e89b-12d3-a456-426614174000';
2. Application Layer Tenant Isolation
# middleware/tenant_isolation.py
import jwt
from functools import wraps
from flask import request, g
import psycopg2.pool
class TenantIsolation:
def __init__(self):
self.pool = psycopg2.pool.ThreadedConnectionPool(
minconn=20,
maxconn=100,
host=os.getenv('DB_HOST'),
database=os.getenv('DB_NAME')
)
def get_tenant_connection(self, tenant_id):
conn = self.pool.getconn()
cursor = conn.cursor()
# Set tenant context for RLS
cursor.execute(
"SET LOCAL app.current_tenant = %s",
(tenant_id,)
)
# Set statement timeout per tenant tier
timeout = self.get_tenant_timeout(tenant_id)
cursor.execute(
"SET LOCAL statement_timeout = %s",
(timeout,)
)
return conn
def require_tenant(f):
@wraps(f)
def decorated_function(*args, **kwargs):
# Extract tenant from JWT
token = request.headers.get('Authorization', '').replace('Bearer ', '')
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=['HS256'])
tenant_id = payload['tenant_id']
g.tenant_id = tenant_id
g.db_conn = tenant_isolation.get_tenant_connection(tenant_id)
except jwt.InvalidTokenError:
return {'error': 'Invalid token'}, 401
return f(*args, **kwargs)
return decorated_function
# Usage in routes
@app.route('/api/products')
@require_tenant
def get_products():
# Queries automatically filtered by tenant
cursor = g.db_conn.cursor()
cursor.execute("SELECT * FROM products") # RLS handles filtering
return cursor.fetchall()
3. Redis Multi-Tenant Caching Strategy
# cache/tenant_cache.py
import redis
from typing import Optional, Any
import json
class TenantCache:
def __init__(self):
self.redis = redis.Redis(
host='redis-cluster.aws.internal',
port=6379,
decode_responses=True,
connection_pool_kwargs={
'max_connections': 100,
'socket_keepalive': True
}
)
def get_tenant_key(self, tenant_id: str, key: str) -> str:
"""Generate tenant-specific key"""
return f"tenant:{tenant_id}:{key}"
def get(self, tenant_id: str, key: str) -> Optional[Any]:
full_key = self.get_tenant_key(tenant_id, key)
value = self.redis.get(full_key)
return json.loads(value) if value else None
def set(self, tenant_id: str, key: str, value: Any,
ttl: int = 3600) -> None:
full_key = self.get_tenant_key(tenant_id, key)
# Check tenant's cache quota
if not self.check_tenant_quota(tenant_id):
# Evict oldest tenant keys if quota exceeded
self.evict_tenant_keys(tenant_id)
self.redis.setex(
full_key,
ttl,
json.dumps(value)
)
def check_tenant_quota(self, tenant_id: str) -> bool:
"""Check if tenant is within cache quota"""
pattern = f"tenant:{tenant_id}:*"
keys = self.redis.scan_iter(match=pattern, count=100)
count = sum(1 for _ in keys)
# Different quotas per tier
tier = self.get_tenant_tier(tenant_id)
quotas = {
'free': 1000,
'pro': 10000,
'enterprise': 100000
}
return count < quotas.get(tier, 1000)
4. Performance Isolation & Resource Limits
# resource_management/tenant_limits.py
from dataclasses import dataclass
from typing import Dict
import asyncio
from asyncio import Semaphore
@dataclass
class TenantLimits:
max_concurrent_requests: int
max_db_connections: int
max_cpu_seconds: int
max_memory_mb: int
api_rate_limit: int # requests per minute
class ResourceManager:
def __init__(self):
self.tenant_limits: Dict[str, TenantLimits] = {}
self.semaphores: Dict[str, Semaphore] = {}
def get_tenant_limits(self, tenant_id: str) -> TenantLimits:
tier = self.get_tenant_tier(tenant_id)
limits = {
'free': TenantLimits(
max_concurrent_requests=10,
max_db_connections=5,
max_cpu_seconds=60,
max_memory_mb=512,
api_rate_limit=100
),
'pro': TenantLimits(
max_concurrent_requests=50,
max_db_connections=20,
max_cpu_seconds=300,
max_memory_mb=2048,
api_rate_limit=1000
),
'enterprise': TenantLimits(
max_concurrent_requests=200,
max_db_connections=100,
max_cpu_seconds=3600,
max_memory_mb=8192,
api_rate_limit=10000
)
}
return limits.get(tier, limits['free'])
async def acquire_resource(self, tenant_id: str):
"""Acquire semaphore for tenant request"""
if tenant_id not in self.semaphores:
limits = self.get_tenant_limits(tenant_id)
self.semaphores[tenant_id] = Semaphore(
limits.max_concurrent_requests
)
return await self.semaphores[tenant_id].acquire()
# API Rate Limiting
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address
def get_tenant_from_request():
"""Extract tenant ID for rate limiting"""
token = request.headers.get('Authorization', '')
if token:
try:
payload = jwt.decode(token.replace('Bearer ', ''),
SECRET_KEY, algorithms=['HS256'])
return payload.get('tenant_id', get_remote_address())
except:
pass
return get_remote_address()
limiter = Limiter(
app,
key_func=get_tenant_from_request,
default_limits=["100 per minute"], # Free tier
storage_uri="redis://redis-cluster.aws.internal:6379"
)
# Apply different limits per tier
@app.route('/api/data')
@limiter.limit("1000 per minute",
key_func=lambda: get_tenant_from_request()
if is_pro_tier() else None)
@limiter.limit("10000 per minute",
key_func=lambda: get_tenant_from_request()
if is_enterprise_tier() else None)
def get_data():
pass
5. Security & Compliance Implementation
# security/tenant_security.py
import hashlib
import boto3
from cryptography.fernet import Fernet
from typing import Dict
class TenantSecurity:
def __init__(self):
self.kms = boto3.client('kms')
self.tenant_keys: Dict[str, bytes] = {}
def get_tenant_encryption_key(self, tenant_id: str) -> bytes:
"""Get or create tenant-specific encryption key"""
if tenant_id not in self.tenant_keys:
# Generate tenant-specific key using KMS
response = self.kms.generate_data_key(
KeyId='arn:aws:kms:us-east-1:xxx:key/xxx',
KeySpec='AES_256',
EncryptionContext={
'tenant_id': tenant_id,
'purpose': 'tenant_data_encryption'
}
)
self.tenant_keys[tenant_id] = response['Plaintext']
# Store encrypted key in database
self.store_encrypted_key(
tenant_id,
response['CiphertextBlob']
)
return self.tenant_keys[tenant_id]
def encrypt_tenant_data(self, tenant_id: str, data: str) -> str:
"""Encrypt data with tenant-specific key"""
key = self.get_tenant_encryption_key(tenant_id)
f = Fernet(key)
return f.encrypt(data.encode()).decode()
def audit_log(self, tenant_id: str, action: str,
resource: str, user_id: str):
"""Tenant-specific audit logging"""
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'tenant_id': tenant_id,
'user_id': user_id,
'action': action,
'resource': resource,
'ip_address': request.remote_addr,
'user_agent': request.headers.get('User-Agent')
}
# Write to tenant-specific CloudWatch log stream
cloudwatch = boto3.client('logs')
cloudwatch.put_log_events(
logGroupName='/aws/saas/audit',
logStreamName=f'tenant-{tenant_id}',
logEvents=[{
'timestamp': int(time.time() * 1000),
'message': json.dumps(log_entry)
}]
)
6. Monitoring & Alerting Per Tenant
# monitoring/tenant_metrics.py
import boto3
from dataclasses import dataclass
from typing import List
import time
@dataclass
class TenantMetrics:
tenant_id: str
api_requests: int
api_errors: int
db_queries: int
cache_hits: int
cache_misses: int
response_time_ms: float
cpu_usage: float
memory_usage_mb: int
class TenantMonitoring:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.metrics_buffer: List[TenantMetrics] = []
def record_metric(self, tenant_id: str, metric_name: str,
value: float, unit: str = 'Count'):
"""Record tenant-specific metric"""
self.cloudwatch.put_metric_data(
Namespace='SaaS/Tenant',
MetricData=[
{
'MetricName': metric_name,
'Dimensions': [
{
'Name': 'TenantId',
'Value': tenant_id
},
{
'Name': 'Tier',
'Value': self.get_tenant_tier(tenant_id)
}
],
'Value': value,
'Unit': unit,
'Timestamp': time.time()
}
]
)
def create_tenant_alarms(self, tenant_id: str):
"""Create CloudWatch alarms per tenant"""
tier = self.get_tenant_tier(tenant_id)
# Different thresholds per tier
thresholds = {
'free': {'error_rate': 0.05, 'response_time': 1000},
'pro': {'error_rate': 0.02, 'response_time': 500},
'enterprise': {'error_rate': 0.01, 'response_time': 200}
}
threshold = thresholds.get(tier, thresholds['free'])
# Error rate alarm
self.cloudwatch.put_metric_alarm(
AlarmName=f'tenant-{tenant_id}-high-error-rate',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=2,
MetricName='ErrorRate',
Namespace='SaaS/Tenant',
Period=300,
Statistic='Average',
Threshold=threshold['error_rate'],
ActionsEnabled=True,
AlarmActions=[SNS_TOPIC_ARN],
AlarmDescription=f'High error rate for tenant {tenant_id}'
)
Migration Strategy: From Multi-Instance to Multi-Tenant
Phase 1: Data Migration (Month 1-2)
# migration/tenant_migration.py
def migrate_customer_to_multitenant(customer_id: str,
instance_db_url: str):
"""Migrate single customer from instance to multi-tenant"""
# 1. Generate tenant UUID
tenant_id = str(uuid.uuid4())
# 2. Connect to customer's instance database
source_conn = psycopg2.connect(instance_db_url)
# 3. Connect to multi-tenant database
target_conn = psycopg2.connect(MULTITENANT_DB_URL)
# 4. Migrate schema with tenant_id
tables = ['users', 'products', 'orders', 'payments']
for table in tables:
# Read from instance
df = pd.read_sql(f"SELECT * FROM {table}", source_conn)
# Add tenant_id column
df['tenant_id'] = tenant_id
# Write to multi-tenant
df.to_sql(table, target_conn, if_exists='append',
index=False, method='multi')
# 5. Migrate Redis data
migrate_redis_data(customer_id, tenant_id)
# 6. Update customer record
update_customer_tenant_mapping(customer_id, tenant_id)
return tenant_id
# Parallel migration script
from concurrent.futures import ThreadPoolExecutor
def batch_migration():
customers = get_all_customers()
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for customer in customers:
future = executor.submit(
migrate_customer_to_multitenant,
customer['id'],
customer['instance_db_url']
)
futures.append(future)
# Track progress
for future in as_completed(futures):
result = future.result()
print(f"Migrated customer to tenant {result}")
Phase 2: Application Cutover (Month 3-4)
Cutover Strategy:
1. Deploy multi-tenant application
2. Configure load balancer for gradual migration
3. Route customers by tier:
- Start with free tier (low risk)
- Move to pro tier
- Finally migrate enterprise
4. Monitor performance and errors
5. Rollback capability per tenant
Phase 3: Infrastructure Cleanup (Month 5-6)
#!/bin/bash
# cleanup_instances.sh
# List all customer instances
INSTANCES=$(aws ec2 describe-instances \
--filters "Name=tag:Type,Values=customer-instance" \
--query "Reservations[].Instances[].InstanceId" \
--output text)
# Terminate instances after verification
for INSTANCE_ID in $INSTANCES; do
CUSTOMER_ID=$(aws ec2 describe-tags \
--filters "Name=resource-id,Values=$INSTANCE_ID" \
"Name=key,Values=CustomerId" \
--query "Tags[0].Value" --output text)
# Verify customer is migrated
if verify_customer_migrated "$CUSTOMER_ID"; then
echo "Terminating instance $INSTANCE_ID for customer $CUSTOMER_ID"
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID"
fi
done
# Clean up RDS instances
# Clean up Redis clusters
# Remove unused VPC resources
Performance Improvements
Despite consolidation, performance actually improved:
Response Time Comparison
Multi-Instance (p50/p95/p99):
- API Latency: 250ms / 800ms / 1500ms
- Database Query: 50ms / 200ms / 500ms
Multi-Tenant (p50/p95/p99):
- API Latency: 120ms / 350ms / 600ms (52% faster p50)
- Database Query: 20ms / 80ms / 150ms (60% faster p50)
Why Performance Improved:
- Better Resource Utilization: Larger instances with consistent CPU performance
- Optimized Connection Pooling: PgBouncer reduced connection overhead
- Shared Cache Benefits: Higher cache hit rates with shared Redis
- Aurora Performance: Better than individual RDS instances
- Reduced Network Hops: Single VPC, less inter-AZ traffic
Lessons Learned
What Worked Well:
- Row-Level Security: PostgreSQL RLS provided bulletproof isolation
- Gradual Migration: Customer-by-customer approach minimized risk
- Tier-Based Limits: Different resource limits per customer tier
- Shared Caching: 85% cache hit rate vs 40% in isolated instances
Challenges We Faced:
-
Noisy Neighbor: One customer running heavy queries affected others
- Solution: Statement timeouts and resource governors
-
Compliance Concerns: Some enterprise customers required isolation
- Solution: Hybrid model - keep top 10 enterprises on dedicated
-
Migration Complexity: Data migration took longer than expected
- Solution: Built automated migration tools and parallel processing
What We'd Do Differently:
- Start with multi-tenant from day one for new products
- Implement better resource isolation earlier
- Build migration tools before starting
- Keep better metrics on per-customer resource usage
Security & Compliance Maintained
Despite shared infrastructure, we maintained security certifications:
- ✅ SOC 2 Type II: Passed with zero findings
- ✅ HIPAA Compliant: With proper BAA and encryption
- ✅ GDPR Ready: Data isolation and right-to-deletion implemented
- ✅ PCI DSS: Tokenization and proper segmentation
Key security features:
- Tenant data encryption with unique keys
- Complete audit logging per tenant
- Data residency controls
- Automated compliance reporting
The Hybrid Approach: Best of Both Worlds
We kept 10 enterprise customers on dedicated instances:
Hybrid Architecture:
Multi-Tenant (190 customers):
- Shared infrastructure
- $8,000/month cost
- Automated management
Multi-Instance (10 enterprises):
- Dedicated resources
- $2,000/month cost
- Premium pricing justified
- Compliance requirements met
ROI Analysis
Investment:
- Engineering time: 3 engineers × 6 months = $180,000
- Migration tools & automation: $20,000
- Monitoring & security tools: $30,000
Total Investment: $230,000
Returns:
- Monthly savings: $35,000
- Annual savings: $420,000
- Payback period: 6.6 months
- 3-year savings: $1,260,000
Additional Benefits:
- Reduced operational overhead (5 hours/week saved)
- Faster feature deployment (1 deployment vs 200)
- Better resource utilization (70% vs 20%)
- Improved system reliability (99.99% vs 99.9%)
Action Items: Your Migration Checklist
If you're considering multi-tenant architecture:
-
Analyze Current Costs
- Document per-customer infrastructure costs
- Identify resource utilization patterns
- Calculate potential savings
-
Design Tenant Isolation
- Choose isolation strategy (database, schema, row-level)
- Implement proper authentication/authorization
- Plan resource limits per tenant
-
Build Migration Tools
- Automated data migration scripts
- Rollback procedures
- Performance testing framework
-
Implement Gradually
- Start with non-critical customers
- Monitor performance closely
- Keep rollback options ready
-
Maintain Security
- Implement encryption per tenant
- Audit logging and monitoring
- Regular security assessments
Conclusion
Multi-tenant architecture isn't just about cost savings - it's about building a sustainable, scalable SaaS platform. Our 78% cost reduction enabled us to:
- Invest more in product development
- Offer competitive pricing
- Improve system performance
- Scale to 500+ customers without linear cost growth
The key is maintaining security and performance while consolidating resources. With proper planning and implementation, multi-tenant architecture can transform your SaaS economics without compromising on quality.
Top comments (1)
Great read and well written. How do you handle noisy neighbors? So if one tenant consumes way more resources than others. Is rate limiting enough?
Some comments may only be visible to logged-in visitors. Sign in to view all comments.