Introduction
After successfully reducing our database costs by 40% (as covered in my previous post on Aurora Serverless v2 migration), our next target was the compute layer. Our EC2 costs were spiraling with our Auto Scaling Groups (ASG) running 24/7 at peak capacity "just to be safe."
This post details how we achieved a 70% cost reduction in our ASG infrastructure while actually improving our availability from 99.9% to 99.99%. The secret? A carefully orchestrated mix of On-Demand and Spot instances, combined with intelligent scaling strategies.
Table of Contents
- The Problem: Over-Provisioning for Peace of Mind
- The Solution: Mixed Instance Strategy
- Implementation Guide
- Advanced Optimization Techniques
- Monitoring and Alerting
- Results and Metrics
- Lessons Learned
- Conclusion
The Problem: Over-Provisioning for Peace of Mind
Our initial setup was typical of many AWS deployments:
- 20 On-Demand instances running constantly
- Scaling up to 50 instances during peak hours
- Monthly cost: ~$7,200
- Actual average utilization: 35%
We were essentially paying for insurance we rarely needed. Sound familiar?
The Solution: Mixed Instance Strategy + Intelligent Scaling
1. The Foundation: Fixed On-Demand + Flexible Spot Instances
The core strategy is simple but powerful:
- Fixed base capacity: On-Demand instances for guaranteed availability
- Variable capacity: Spot instances for cost-effective scaling
- Intelligent distribution: 30% On-Demand, 70% Spot during normal operations
Here's our optimized ASG configuration:
# auto-scaling-group.yaml
AWSTemplateFormatVersion: '2010-09-09'
Resources:
OptimizedASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: production-web-asg
MinSize: 6
MaxSize: 50
DesiredCapacity: 10
HealthCheckType: ELB
HealthCheckGracePeriod: 300
# Mixed Instances Policy - The Key to Cost Savings
MixedInstancesPolicy:
InstancesDistribution:
OnDemandAllocationStrategy: prioritized
OnDemandBaseCapacity: 3 # Fixed On-Demand instances
OnDemandPercentageAboveBaseCapacity: 20 # 20% of additional capacity
SpotAllocationStrategy: capacity-optimized-prioritized
SpotInstancePools: 4 # Diversify across 4 instance types
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
Overrides:
# Diversified instance types for better Spot availability
- InstanceType: t3.medium
WeightedCapacity: 1
- InstanceType: t3a.medium
WeightedCapacity: 1
- InstanceType: t2.medium
WeightedCapacity: 1
- InstanceType: m5.large
WeightedCapacity: 2
2. Implementing Predictive Scaling
Instead of reactive scaling, we implemented predictive scaling based on historical patterns:
# predictive_scaling_config.py
import boto3
from datetime import datetime, timedelta
autoscaling = boto3.client('autoscaling')
def configure_predictive_scaling(asg_name):
"""Configure predictive scaling policy for ASG"""
response = autoscaling.put_scaling_policy(
AutoScalingGroupName=asg_name,
PolicyName='predictive-scaling-policy',
PolicyType='PredictiveScaling',
PredictiveScalingConfiguration={
'MetricSpecifications': [
{
'TargetValue': 50.0,
'PredefinedMetricPairSpecification': {
'PredefinedMetricType': 'ASGAverageCPUUtilization'
}
}
],
'Mode': 'ForecastAndScale',
'SchedulingBufferTime': 600, # 10 minute buffer
# Predict based on last 2 weeks of data
'MaxCapacityBreachBehavior': 'IncreaseMaxCapacity',
'MaxCapacityBuffer': 10 # Allow 10% above max for unexpected spikes
}
)
return response
# Enable predictive scaling
configure_predictive_scaling('production-web-asg')
3. Warm Pool Strategy for Instant Scaling
To maintain high availability while using Spot instances, we implemented a Warm Pool:
WarmPool:
Type: AWS::AutoScaling::WarmPool
Properties:
AutoScalingGroupName: !Ref OptimizedASG
MinSize: 5
MaxGroupPreparedCapacity: 10
PoolState: Stopped # Save costs by keeping instances stopped
InstanceReusePolicy:
ReuseOnScaleIn: true # Reuse instances to save on provisioning time
This reduced our scale-up time from 5 minutes to 30 seconds!
Advanced Cost Optimization Techniques
A. Scheduled Scaling for Predictable Patterns
# scheduled_scaling.py
def create_scheduled_actions(asg_name):
"""Create scheduled scaling actions for predictable traffic patterns"""
# Scale up for business hours (Mon-Fri, 8 AM - 6 PM)
autoscaling.put_scheduled_update_group_action(
AutoScalingGroupName=asg_name,
ScheduledActionName='business-hours-scale-up',
MinSize=10,
DesiredCapacity=15,
Recurrence='0 8 * * MON-FRI', # 8 AM Mon-Fri
TimeZone='America/New_York'
)
# Scale down for nights and weekends
autoscaling.put_scheduled_update_group_action(
AutoScalingGroupName=asg_name,
ScheduledActionName='off-hours-scale-down',
MinSize=3,
DesiredCapacity=5,
Recurrence='0 18 * * MON-FRI', # 6 PM Mon-Fri
TimeZone='America/New_York'
)
# Weekend minimum capacity
autoscaling.put_scheduled_update_group_action(
AutoScalingGroupName=asg_name,
ScheduledActionName='weekend-minimum',
MinSize=2,
DesiredCapacity=3,
Recurrence='0 0 * * SAT', # Saturday midnight
TimeZone='America/New_York'
)
B. Spot Instance Interruption Handling
Here's our battle-tested Spot interruption handler:
# spot_interruption_handler.py
import requests
import time
import logging
import boto3
from concurrent.futures import ThreadPoolExecutor
logger = logging.getLogger(__name__)
ec2 = boto3.client('ec2')
elbv2 = boto3.client('elbv2')
class SpotInterruptionHandler:
def __init__(self, instance_id, target_group_arn):
self.instance_id = instance_id
self.target_group_arn = target_group_arn
self.metadata_url = "http://169.254.169.254/latest/meta-data/spot/instance-action"
def check_spot_interruption(self):
"""Check for Spot instance interruption notices"""
try:
response = requests.get(self.metadata_url, timeout=1)
if response.status_code == 200:
interruption_data = response.json()
logger.warning(f"Spot interruption notice: {interruption_data}")
# We have 2 minutes to act
self.handle_interruption(interruption_data['time'])
return True
except requests.exceptions.RequestException:
# No interruption notice (normal operation)
return False
def handle_interruption(self, interruption_time):
"""Gracefully handle Spot interruption"""
with ThreadPoolExecutor(max_workers=3) as executor:
# Parallel execution for speed
executor.submit(self.drain_connections)
executor.submit(self.save_state)
executor.submit(self.notify_monitoring)
# Deregister from target group
self.deregister_from_alb()
def drain_connections(self):
"""Stop accepting new connections and drain existing ones"""
# Application-specific implementation
logger.info("Draining connections...")
# Set health check to fail
with open('/var/www/health-check', 'w') as f:
f.write('draining')
# Wait for connections to drain (max 90 seconds)
time.sleep(90)
def deregister_from_alb(self):
"""Deregister instance from ALB target group"""
try:
elbv2.deregister_targets(
TargetGroupArn=self.target_group_arn,
Targets=[{'Id': self.instance_id}]
)
logger.info(f"Deregistered {self.instance_id} from target group")
except Exception as e:
logger.error(f"Failed to deregister: {e}")
# Run this as a daemon on each instance
if __name__ == "__main__":
handler = SpotInterruptionHandler(
instance_id=requests.get('http://169.254.169.254/latest/meta-data/instance-id').text,
target_group_arn='arn:aws:elasticloadbalancing:region:account:targetgroup/name/id'
)
while True:
if handler.check_spot_interruption():
break
time.sleep(5)
C. Multi-AZ Spot Diversification
Spread your risk across availability zones and instance types:
# spot-diversification-template.yaml
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: diverse-spot-template
LaunchTemplateData:
IamInstanceProfile:
Arn: !GetAtt InstanceProfile.Arn
SecurityGroupIds:
- !Ref WebSecurityGroup
UserData:
Fn::Base64: !Sub |
#!/bin/bash
# Install Spot interruption handler
curl -o /usr/local/bin/spot-handler.py ${SpotHandlerUrl}
chmod +x /usr/local/bin/spot-handler.py
# Run as systemd service
cat > /etc/systemd/system/spot-handler.service << EOF
[Unit]
Description=Spot Instance Interruption Handler
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/python3 /usr/local/bin/spot-handler.py
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl enable spot-handler
systemctl start spot-handler
# Your application startup script here
/opt/app/start.sh
Monitoring and Alerting
Comprehensive monitoring is crucial for maintaining high availability:
# monitoring_setup.py
import boto3
cloudwatch = boto3.client('cloudwatch')
def create_comprehensive_monitoring(asg_name, sns_topic_arn):
"""Set up comprehensive monitoring for mixed instance ASG"""
alarms = []
# 1. On-Demand instance health check
alarms.append(cloudwatch.put_metric_alarm(
AlarmName=f'{asg_name}-on-demand-minimum',
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=2,
MetricName='GroupInServiceInstances',
Namespace='AWS/AutoScaling',
Period=60,
Statistic='Minimum',
Threshold=3.0, # Minimum On-Demand instances
ActionsEnabled=True,
AlarmActions=[sns_topic_arn],
AlarmDescription='On-Demand instances below minimum threshold',
Dimensions=[
{
'Name': 'AutoScalingGroupName',
'Value': asg_name
}
]
))
# 2. Spot interruption rate monitoring
alarms.append(cloudwatch.put_metric_alarm(
AlarmName=f'{asg_name}-high-spot-interruption-rate',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=1,
MetricName='SpotInstanceInterruptionWarnings',
Namespace='AWS/EC2',
Period=300,
Statistic='Sum',
Threshold=5.0,
ActionsEnabled=True,
AlarmActions=[sns_topic_arn],
AlarmDescription='High Spot instance interruption rate detected'
))
# 3. Cost anomaly detection
alarms.append(cloudwatch.put_metric_alarm(
AlarmName=f'{asg_name}-cost-anomaly',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=4,
MetricName='EstimatedCharges',
Namespace='AWS/Billing',
Period=3600, # 1 hour
Statistic='Maximum',
Threshold=100.0, # Adjust based on your expected hourly cost
ActionsEnabled=True,
AlarmActions=[sns_topic_arn],
AlarmDescription='Unexpected cost spike detected',
Dimensions=[
{
'Name': 'Currency',
'Value': 'USD'
}
]
))
# 4. Application-level availability
alarms.append(cloudwatch.put_metric_alarm(
AlarmName=f'{asg_name}-application-availability',
ComparisonOperator='LessThanThreshold',
EvaluationPeriods=3,
MetricName='HealthyHostCount',
Namespace='AWS/ApplicationELB',
Period=60,
Statistic='Average',
Threshold=0.9, # 90% of hosts should be healthy
ActionsEnabled=True,
AlarmActions=[sns_topic_arn],
TreatMissingData='breaching',
AlarmDescription='Application availability below threshold'
))
return alarms
# Create all monitoring alarms
create_comprehensive_monitoring(
'production-web-asg',
'arn:aws:sns:us-east-1:123456789012:asg-alerts'
)
Results: 70% Cost Reduction, Better Availability
After implementing these strategies, here are our results:
💰 Cost Breakdown:
Metric | Before | After | Improvement |
---|---|---|---|
Monthly Cost | $7,200 | $2,160 | 70% reduction |
Cost per Million Requests | $24 | $7.20 | 70% reduction |
On-Demand Instances | 20-50 | 3-10 | 85% reduction |
Spot Instance Usage | 0% | 70% | New savings source |
📈 Availability Improvements:
Metric | Before | After | Improvement |
---|---|---|---|
Availability | 99.9% | 99.99% | 10x better |
Monthly Downtime | 43 minutes | 4.3 minutes | 90% reduction |
Scale-up Time | 5 minutes | 30 seconds | 10x faster |
Recovery Time | 10 minutes | < 2 minutes | 5x faster |
🚀 Performance Metrics:
- Request latency: No change (same instance types)
- Spot interruption impact: < 0.01% of requests affected
- Warm Pool efficiency: 95% hit rate during scale events
- Predictive scaling accuracy: 92% (reduced reactive scaling by 80%)
Key Lessons Learned
1. Start Conservative
Begin with 50% Spot instances and gradually increase as you gain confidence. We started at 50/50 and moved to 30/70 after 3 months.
2. Diversification is Critical
- Use at least 4 different instance types
- Spread across multiple AZs
- Consider cross-region failover for critical apps
3. Warm Pools are Game-Changers
The slight additional cost (~$50/month) pays for itself in:
- Improved user experience during scaling
- Reduced Spot interruption impact
- Better handling of traffic spikes
4. Monitor Everything
Set up alerts for:
- Minimum On-Demand capacity
- Spot interruption rates
- Cost anomalies
- Application-level metrics
5. Test, Test, Test
Regular chaos engineering exercises:
# Simulate Spot interruptions in staging
aws ec2 terminate-instances --instance-ids $(aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=staging" \
"Name=instance-lifecycle,Values=spot" \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text | shuf -n 3)
Implementation Checklist
Ready to implement this in your environment? Here's your checklist:
- [ ] Analyze current ASG utilization patterns (use CloudWatch metrics)
- [ ] Calculate minimum On-Demand capacity needed (peak traffic / instance capacity)
- [ ] Set up Mixed Instances Policy with 4+ instance types
- [ ] Implement Warm Pool (start with 20% of peak capacity)
- [ ] Configure predictive scaling based on 2 weeks of data
- [ ] Set up scheduled scaling for known patterns
- [ ] Deploy Spot interruption handlers on all instances
- [ ] Create comprehensive CloudWatch alarms
- [ ] Test Spot interruption handling in staging
- [ ] Document runbooks for various failure scenarios
- [ ] Set up cost allocation tags for tracking savings
Conclusion
Optimizing Auto Scaling Groups doesn't require sacrificing availability for cost savings. By intelligently combining On-Demand and Spot instances with modern scaling strategies, we achieved dramatic cost reductions while actually improving our service reliability.
The key is starting with a solid foundation of On-Demand instances and gradually optimizing with Spot instances, predictive scaling, and warm pools. With proper monitoring and automation, you can maintain enterprise-grade availability at startup-friendly costs.
Next Steps
- Start with the Mixed Instances Policy - it's the quickest win
- Add Warm Pools once you're comfortable with Spot instances
- Implement predictive scaling after gathering 2 weeks of metrics
- Continuously monitor and optimize based on your specific patterns
What's your experience with ASG cost optimization? Have you tried mixing instance types? I'd love to hear about your strategies in the comments!
This is part 2 of my AWS Cost Optimization series. Check out Part 1: Zero-Downtime RDS to Aurora Serverless v2 Migration
Next in this series: "Zero-Downtime Blue-Green Deployments with 90% Less Infrastructure Cost"
Found this helpful? Follow me for more AWS cost optimization tips and real-world DevOps experiences!
Top comments (0)