It is well-known that containerizing an application can help reduce server costs. But if not designed properly, it can increase other costs such as bandwidth costs. In this post, I’ll tell you about how we raked up an $800 bandwidth charge when we first used ECS-Fargate
When you first deploy an ECS Service, the ECS agent fetches your Docker image from an image registry like Dockerhub or ECR. With this downloaded image, the agent will spawn a docker container via the command
docker run your-docker-image. ECS then runs a health check to see if your application is running. If it passes the health check, the load balancer redirects traffic to the container. If it fails several health checks, the container is killed. The docker agent then attempts to start another container from that same image.
ECS has two types of services, and they differ on how they handle restart attempts.
In ECS-EC2, you manage the fleet of EC2 instances that runs your containers. The number of containers you can run is limited by the CPU and memory capacity of your fleet. If an instance doesn't have the image, it downloads it once and stores it locally. Hence, after the first download, the image is already in the instance. When your docker agent does
docker run, it fetches the image locally.
The underlying EC2 instance in which your container runs is abstracted from you. You don't have access to the EC2 instances running your containers. AWS manages these instances for you; hence, the service becomes serverless. For a bit of a premium, you are freed from the operational burden of managing a fleet of EC2 instances.
There is a high possibility that the EC2 instance that your container runs on the first time isn't the same as the one it runs on the second time. Hence, the agent has to fetch your image from ECR every time ECS attempts to spawn another container.
I was migrating our services from ECS-EC2 to ECS-Fargate. However, I was not able to properly set up one service. I left the service in a misconfigured state. Since the containers the service made was misconfigured, it never ran the application inside it. So, it just keeps failing health checks. After a few failures, the container gets destroyed and the service attempts to make another one. Since the instances underneath ECS-Fargate containers keep changing, it keeps needing to download my 500MB docker image every time it restarts. Imagine how 500MB every 2-3 minutes easily got to 16TB in one month.
One of my past times at work is examining our AWS bill. I was expecting significant savings because we moved from having 10 m5.large EC2 instances to just 1 m5.large instance and several ECS-Fargate containers. But the savings were just less than half of what I expected. So I dug deeper and found out that our NAT Gateway charges had a 6x fold increase. Our bandwidth consumption went from 38GB/mo to 16TB/mo!
The NAT Gateway is one entity through which resources in the private subnet access the internet. It charges for $0.045 for every GB that flows through it.
Since the cost increase coincided with the upgrade to ECS Fargate, I decided to turn off all our containers for a few minutes. The bandwidth consumption suddenly went down:
To narrow down on a particular service, I decided to turn on everything except that one service that I left misconfigured. The costs suddenly went back up again!! That’s when I discovered that leaving a service in a state of misconfiguration in ECS Fargate increases costs.
Never leave your ECS services in a state of misconfiguration. If you can't finish the setup, at least put the container count to zero so it does not keep on spawning containers.
Also, use your AWS bill as a feedback mechanism. A sudden, unintended cost in one aspect of your bill can mean something has gone wrong.
Special thanks to Allen, my editor, for helping this post become more coherent.
I'm happy to take your comments/feedback on this post. Just comment below, or message me!