In this post, I’ll create awareness on AWS CloudWatch costs and help you better understand what could cause these costs.
Whenever you build a service or application, to get insights, logs and metrics are crucial in a setup. In case of trouble, logs are the first place to look. Besides, metrics enable you to predict events and to get notified as soon as numbers start deviating.
Often starting small, together with a service grows the amount of logs and metrics. In AWS, metrics and logs are part of AWS CloudWatch.
Over the years, I've noticed the following regarding CloudWatch cost. Most of the time, AWS CloudWatch is not the most expensive service on the AWS Bill. Usually, CloudWatch costs don't even make it to the top three. CloudWatch seems to be a master in disguise in a goody-goody way. Often it ends in the fourth or fifth place on your AWS bills. That place -just outside the top three- appears to be the perfect spot to eat up a good portion of your budget without becoming a usual suspect in the next cost-saving round. 😓
I'm going to be straight; I suspect AWS CloudWatch to be one of AWS’s cash cows. Why? Besides being a master at staying under the radar, CloudWatch also has costly defaults. Take log group retention as an example. The default for log groups to never expire only makes sense to AWS! My suspicion for CloudWatch being a cash cow is amplified by the fact that it’s damn hard (to nearly impossible) to get useful cost insights.
First, regarding CloudWatch Logs, note that it's not data storage that’s costly. It’s data ingestion that generates costs. Your first question about logs:
which log groups generate the highest cost? A good insight isn’t available out of the box, but I figured out an excellent way to do it. I’ve built a graph showing the log group data ingestion distribution. The graph shows the log groups eating up the most money from left to right.
Immediately you notice a significant outlier on the far left of the graph. Not a big surprise; in this case, the clear winner is AWS CloudTrail. Outliers like these make it hard to zoom in on other elements. To further drill down in the graph, start disabling the outliers on the graph’s legend to the right.
Setting up the log group data ingestion distribution graph yourself is easy. In CloudWatch, create a new dashboard and add a new widget. The most important thing is to fill in the following query to build the graph:
To enhance the contrast between group sizes, you can play around with the graphs period:
Moving the legend to the right helps to select certain groups.
The graph visualization allows you to prioritize the log groups optimization. Ultimately, the optimization will likely come down to lowering your number of logs:
- Do you need that many log statements? Challenge yourself!
- Question the log level. Having a DEBUG level is sometimes great but using that log level everywhere all the time is probably overkill
- The CloudWatch log agent is not shy to generate a good amount of logs/costs
I know, this is kicking in an open door 😄
If getting logs insights was complex, getting metrics data is even trickier. On top, I have the feeling CloudWatch metrics are an even bigger money pit than logs. Worse, there’s currently no way to get an insight into CloudWatch Metrics. I contacted support asking for a way, but I‘m told it’s impossible.😟
So all that I can do for now is create awareness about CloudWatch metrics cost. Remember that metric costs can become expensive very fast. For example, at some point, I created a deep health check on a fleet of instances with the help of some custom metrics. I assumed the cost would be negligible, only to discover those deep health checks cost around 900$ a month. I can tell you that health is long gone now. 😉
My advice is to check your CloudWatch cost at least monthly. Be aware that CloudWatch costs often tend to stay under the radar and try to find a way to get a good insight into your CloudWatch costs.
Enjoy and until next time!