In my previous article, Fargate + Lambda are better together, I introduced the concept of a hybrid architecture where I can route traffic between ECS Fargate and Lambda based on real-time conditions.
The most common argument I know about is:
- Why not just use containers for everything?
In this article, I try to explain why the "containers are cheaper" is an oversimplification.
The Hidden Cost Equation
Before I go into details, let's address the costs (hopefully I did the calculation correctly):
Assuming typical production capacity:
- Daily capacity: ~2M requests/day per task
- Monthly capacity: ~60M requests/month per task
Monthly Cost Comparison:
Monthly Requests | Tasks Needed | Lambda Cost | Fargate Cost | Winner |
---|---|---|---|---|
1M | 1 | $2,17 | $56,92 | Lambda 96% cheaper |
10M | 1 | $21,67 | $56,92 | Lambda 62% cheaper |
25M | 1 | $54,17 | $56,92 | Lambda 5% cheaper |
50M | 1 | $108,33 | $56,92 | Fargate 47% cheaper |
65M | 2 | $140,83 | $113,84 | Fargate 19% cheaper |
100M | 2 | $216,67 | $113,84 | Fargate 47% cheaper |
130M | 3 | $281,67 | $170,76 | Fargate 39% cheaper |
Daily Cost Comparison:
Daily Requests | Tasks Needed | Lambda Cost | Fargate Cost | Winner |
---|---|---|---|---|
1M | 1 | $2,17 | $1,90 | Fargate 12% cheaper |
10M | 5 | $21,67 | $9,50 | Fargate 56% cheaper |
25M | 12 | $54,17 | $22,80 | Fargate 58% cheaper |
50M | 24 | $108,33 | $45,60 | Fargate 58% cheaper |
65M | 31 | $140,83 | $58,90 | Fargate 58% cheaper |
100M | 47 | $216,67 | $89,30 | Fargate 59% cheaper |
130M | 61 | $281,67 | $115,90 | Fargate 59% cheaper |
The tables show that containers become cheaper around 50M requests/month or over 1M requests daily.
If I consider the human cost, based on European senior developer rates (Stack Overflow 2024 Survey - €75K average = €36/hour):
Engineering Effort | Time Investment | Cost |
---|---|---|
Load testing and optimization | 2 weeks (80 hours) | €2.880 |
Traffic controller development | 2 weeks (80 hours) | €2.880 |
Production fixes and fine-tuning | 1 week (40 hours) | €1.440 |
Total Fargate human cost: €7.200
With Lambda, I skip most of this complexity:
- No failure scenario testing: AWS handles all for me
- No traffic controller: No need to invent it
- No production fine-tuning: It just works out of the box
With 14 operational scenarios, I need to simulate and prepare for each failure mode, while Lambda eliminates all of them through abstraction.
Even when Fargate appears cheaper on infrastructure costs (around 50M+ requests/month or 1M+ requests/day). The €7.200 engineering investment only becomes cost-effective after many months.
Payback Calculation Formula:
Monthly Savings = (Lambda Daily Cost - Fargate Daily Cost) × 30 days
Payback Period (months) = €7,200 ÷ Monthly Savings (USD)
Examples:
• 10M requests/day: ($21.67 - $9.50) × 30 = $365.10/month
Payback = €7,200 ÷ $365.10 = 19.7 months
• 25M requests/day: ($54.17 - $22.80) × 30 = $941.10/month
Payback = €7,200 ÷ $941.10 = 7.6 months
Human Cost Payback:
Traffic Volume | Daily Savings | Monthly Savings | Payback Period |
---|---|---|---|
1M requests/month | N/A | -$54.75 | Never (Lambda cheaper) |
10M requests/day | $12.17 | $365.10 | 19.7 months |
25M requests/day | $31.37 | $941.10 | 7.6 months |
50M requests/day | $62.73 | $1,881.90 | 3.8 months |
Based on the table, the €7.200 human cost works are only cost-effective from 7-8 months at extremely high traffic (25M+ requests/day) to multiple years at moderate traffic levels, assuming:
- No major architectural changes
- No scaling issues requiring rework
- No additional failure modes discovered
What Makes Fargate "Complex"
Let's see what it takes to run containers in production:
ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Sub "${AWS::StackName}-cluster"
CapacityProviders:
- FARGATE
- FARGATE_SPOT
DefaultCapacityProviderStrategy:
- CapacityProvider: FARGATE
Weight: 2 # 40% on-demand for stability
- CapacityProvider: FARGATE_SPOT
Weight: 3 # 60% spot for cost savings
ClusterSettings:
- Name: containerInsights
Value: enhanced
Without the Amazon ECS Container Insights metrics enabled, I would not be able to monitor metrics that could cause downtime, like:
- CPU
- Memory
- RunningTaskCount
Most teams start with what I call the 'monolithic service trap' - one big ECS service that seems simple until it isn't.
MyService:
Type: AWS::ECS::Service
Properties:
ServiceName: !Sub "${AWS::StackName}-my-service"
Cluster: !Ref ECSCluster
TaskDefinition: !Ref MyTaskDefinition
LaunchType: FARGATE
PlatformVersion: LATEST
DeploymentConfiguration:
Strategy: ROLLING
MaximumPercent: 200 # Allow 100% extra capacity during deployment
MinimumHealthyPercent: 100 # Keep all current tasks running until new ones are healthy
DeploymentCircuitBreaker:
Enable: true
Rollback: true
DeploymentController:
Type: ECS
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
SecurityGroups:
- !Ref ECSSecurityGroupId
Subnets: !FindInMap [!Ref StageName, !Ref "AWS::Region", PrivateSubnets]
LoadBalancers:
- ContainerName: my-container
ContainerPort: 3000
TargetGroupArn:
Fn::ImportValue:
!Sub "${ALBStackName}-My-ECS-TargetGroup-Arn"
HealthCheckGracePeriodSeconds: 60
MyTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Sub "${AWS::StackName}-my-task-definition"
NetworkMode: awsvpc
RequiresCompatibilities:
- FARGATE
Cpu: 2048
Memory: 4096
ExecutionRoleArn: !GetAtt ECSExecutionRole.Arn
TaskRoleArn: !GetAtt ECSTaskRole.Arn
ContainerDefinitions:
- Name: my-container
Image: !Ref MyImageUri
Essential: true
PortMappings:
- ContainerPort: 3000
Protocol: tcp
Environment:
- Name: AWS_REGION
Value: !Ref "AWS::Region"
- Name: LOG_LEVEL
Value: !FindInMap [LogLevel, !Ref StageName, level]
- Name: NODE_OPTIONS
Value: "--max-old-space-size=3072" # Use 75% of 4GB memory for heap
....
The problems with this approach are:
- Single Point of Failure: One bad deployment kills everything
- All-or-Nothing Updates: Can't update just something without affecting something else
- Blast Radius: one endpoint path can bring all down
I like to organise my architecture based on specific parameters, and they can vary based on the application. I am working on an international streaming application, so I could potentially split my services into:
- country (maybe it is better to have ECS per country)
- type of users
- type of subscriptions
- type of platform
The above are examples, but they have some obvious advantages:
- Fault Isolation
- Independent Deployments
- Smaller Blast Radius
- Resource Isolation
- Independent Capacity
- Observability
- Scalability
A single monolithic service creates availability and scalability bottlenecks, where the failure of one component can impact the entire system.
On the opposite extreme, there is excessive fragmentation. When services are correctly isolated, a bug or outage in one service cannot directly impact other services. This isolation is particularly valuable when services have different reliability requirements or face varying load patterns, as failures remain localised instead of bringing all down.
A lesson I have learned is that service decomposition becomes economically justified when different components have significantly different resource traffic patterns. Each component scales independently according to its actual needs. Without this separation, the entire system must be provisioned for peak load across all components, often resulting in extra overprovisioning and wasted resources.
I want to highlight some important configurations for AWS::ECS::Service
and AWS::ECS::TaskDefinition
Deployment Strategy
While AWS ECS supports blue/green deployments through CodeDeploy integration, this feature does not work when ALB target groups dynamically route traffic between ECS and Lambda functions.
Blue/green deployments require dedicated target groups for each deployment stage, which conflicts with my setup.
The rolling deployment strategy provides zero-downtime deployments while maintaining compatibility with my hybrid traffic management system.
Resource Allocation: CPU and Memory Sizing
Cpu: 2048 # 2 vCPU - Increased from 1024 due to load testing findings
Memory: 4096 # 4GB - Increased from 2048 due to memory pressure and GC issues
The resource allocation reflects the results from the load testing. Initially, I had (1 vCPU, 2GB RAM), but it resulted in cascading performance issues during traffic spikes:
- CPU Saturation
- Memory Pressure
Fargate Cold Start
As Vlad Ionescu pointed out in his post, Fargate takes its time to scale up. From what I can see, it takes up to 5 minutes. Always, thanks to the load testing, I found out that newly launched tasks were receiving full traffic upon ALB target registration, before completing their initialisation routines, because they were considered 'healthy', which contributed to 502/503 errors.
To mitigate this, I added an extra configuration:
MyServiceSTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
TargetGroupAttributes:
- Key: slow_start.duration_seconds
Value: "60"
In the end, by increasing resource allocation to 2 vCPU/4GB and implementing a 60-second ALB slow start period, I achieved:
- Reduced Error Rates: The combination eliminated the 502/503 error spikes during scaling events
- Better Cost Efficiency: While individual tasks cost more, we need fewer of them and experience fewer scaling thrashes
The 14 potential pitfalls
Lambda is a full delegation to AWS. It is magic.
ECS Fargate has its complexity, but this is the price of control.
Here are all the cases I found while testing
Failure Category | What Can Go Wrong | Real-World Impact |
---|---|---|
🔥 Emergency Failures | Zero running tasks available | 🚨 Complete service outage - immediate Lambda failover required |
⚠️ Capacity Issues | Down to minimum viable tasks | 📉 Gradual failure prevention - start shifting to Lambda |
💀 Critical Capacity | Only one task left running | 💣 Last chance before outage - emergency traffic routing |
🚀 Traffic Overload | Requests exceed Lambda capacity | ⬆️ Lambda can't handle load - need ECS scaling |
💥 Traffic Spike | Sudden burst traffic overwhelms system | ⚡ Viral content or DDoS - hybrid ECS+Lambda mode |
🌊 Sustained Load | High traffic persists for extended period | 💰 Cost optimization - shift to ECS-heavy hybrid mode |
😴 Low Traffic | Traffic drops below cost-effective threshold | 💸 ECS waste - switch back to Lambda |
💻 CPU Exhaustion | CPU utilization reaches warning threshold | 🥵 Resource competition - enable Lambda overflow |
🔴 CPU Emergency | CPU utilization hits critical levels | 🚒 Cascade failure risk - emergency Lambda routing |
💾 Memory Pressure | Memory usage approaches warning limits | 🤯 Memory leaks detected - Lambda overflow |
‼️ Memory Emergency | Memory usage hits critical threshold | 🆘 OOM kills imminent - immediate failover |
⬇️ Scale-In Problems | Auto-scaling fails to reduce capacity efficiently | 💵 Idle tasks burning money - gradual scale-down |
💀 Spot Interruptions | AWS reclaims spot instances | 🌩️ Multiple tasks lost suddenly |
🎯 Path-Specific Issues | Individual service failures | 🚧 Blast radius contained to one service |
Notice how many scenarios demand "Lambda failover" or "Lambda overflow" as solutions. This is because Lambda has a unique strength: it can instantiate thousands of instances in milliseconds.
This remarkable ability comes at a cost, as I pay more because AWS manages everything for me. But when Fargate faces any of these 14 challenges, Lambda's instant scalability becomes priceless.
- Zero capacity planning becomes an asset when you need emergency capacity instantly
- Automatic scaling becomes crucial when Fargate scaling can't keep up with traffic spikes
- No infrastructure management means no infrastructure bottlenecks during a crisis
- Isolation per request ensures consistent performance regardless of load
Performance trade-offs
- Fargate Performance Degrades Under Load: Fargate latency increases as more resources are consumed within the same task. A task handling 1 request performs very differently from one handling thousands simultaneously
- Lambda Maintains Consistent Performance: Lambda stays consistent because it's 1 request per execution environment
- Different Performance Profiles: Fargate excels at sustained moderate load (p95: 30ms vs 70ms), while Lambda excels at consistent performance regardless of system load
- Cost vs Performance Trade-off: You pay Lambda's higher price for guaranteed isolation and instant scalability
- Failure Recovery: Lambda's instant provisioning makes it ideal for emergency scenarios when Fargate fails
Fargate performs better in stable conditions but degrades under stress, and its total cost of ownership is so high that its complexity pays off only in the long term.
Lambda performs consistently but at a higher cost per request, but this higher pricing compensates for operational simplicity and faster delivery cycles.
It's Not About Cost, It's About Risk
After running this hybrid architecture in production, here's what I've learned:
- Lambda is not more expensive
- Containers are not cheaper
- Hybrid architectures give you the best of both worlds
What's Next: Traffic Controller
I'll share the Traffic Controller implementation that makes this hybrid magic possible. The Traffic Controller is a component that dynamically routes traffic between Lambda and ECS based on real-time traffic patterns.
A component that monitors CloudWatch alarms, understands task-aware load distribution, and makes routing decisions.
Top comments (0)