My AWS service was over-provisioned by ~1000 times: A 5-minute guide on how to check app resource use on fargate

#metrics #billing #profiling #aws

Very quick backstory:

I built and am working on an app called Kichi. It's an app to help intermediate learners continue to learn Japanese and English at a very fast pace.

I've done a lot of deploys in my software career, and I've found a really good recipe for starting any new deploy. In AWS speak, you want to use RDS with a db.t4g.small wired up to two t4g.medium. This makes sure that you can handle unexpected loads if your service suddenly spikes, and that your DB won't go down assuming that your app is more process-intense than it is DB-intense. You then want your DB to be AZ-multi-instanced so that you don't have to worry about an outage taking you offline and so that your maintenance window doesn't take your service offline.

This is an excellent strategy for launching a service for a company. This is a terrible strategy for launching your own service from scratch. For reference - I knew my service would need to start small, so I set my db to micro and my two containers to 1 GiB memory with .5 vCPU.

How to actually profile your production app

So, something that I found profoundly annoying is that everyone seemed to tell me "profile your app to see how many resources you need", but no one actually told me how to do that. So, I found out.

1) Put your app on Fargate. I only say this because this is my current setup.
2) Click "Enable container insights" when you deploy. This will send a ton of metrics to CloudWatch. Note: this is very, very expensive to maintain long term.
3) Use your service for a little bit. Send a few of what you think might be your most demanding requests.
4) Go to the CloudWatch Management Console, click "Logs", then "Log insights".
5) Click 'Select Log Groups' on the top of the screen, and pick your logs that look something like aws/ecs/containerInsights/{clusterName}/performance
6) Prepare to be shocked at how efficient your app probably is.

Please note that for each query, simply change avg to max or vice-versa to see the average or max for that metric.

Memory Usage (in MiB)

stats max(MemoryUtilized) by bin (30m) as period, TaskDefinitionFamily, TaskDefinitionRevision
                | filter Type = "Task" | sort period desc, TaskDefinitionFamily |  limit 10

The number that comes out from this query will be the maximum memory that your containers have actually consumed, in MiB. I had provisioned 1GB per container. My max was ~299, so I dropped it to 512MB per container. Saved 50%.

CPU Usage

stats max(CpuUtilized) by bin (30m) as period, TaskDefinitionFamily, TaskDefinitionRevision
                | filter Type = "Task" | sort period desc, TaskDefinitionFamily |  limit 10

This one blew me away. So, the number that came out of this search was 0.7. I thought "Okay, so that's like 0.7 vCPU, that's not too bad."

Then I ran a different query to see what the container itself had provisioned for maximum CPU usage, just to double-check (that query can be found in the metrics source link at the bottom). The number that came back was 512.

Not .5. 512.

My app, with all of its OCR and DB access and sync and whatnot, only used about 1/100th of a vCPU in production.

I immediately reduced my docker containers from 2 to 1, and reduced the CPU allotment to 256. This saved me 75%. Then I turned off container insights, because CloudWatch was getting expensive and I wouldn't need to profile again for a long time.

So yeah. That's how you actually profile an app and see how much Memory/CPU it takes, and figure out how over-provisioned you are.

Thanks for reading.

Source for all metrics: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-ECS.html