Serverless lets you deploy applications far away in a data center of a cloud provider. This relieves you of the lion's share of operational burdens. The more you buy into your cloud provider's ecosystem, the less you have to do yourself: no more OS updates or database bugfix installations.
But you still need to do some operation-related work on your own. For instance, monitoring your application to know what's going on in that far away data center.
Usually, the monitoring journey of a new software product in the cloud goes like this:
The first version gets built with just basic monitoring capabilities or worse, without any monitoring. Then things go wrong as they always do, and nobody is really sure why. After debugging the problem, people get the idea that they didn't have enough metrics. Depending on the severity of the issue, this leads either to more metrics added to the monitoring setup with every new incident or an overkill solution where everything is monitored.
Either way, after a sufficient number of issues with the application, there are so many metrics and alerts set up that you went from having no insight into your infrastructure to having so much insight that important parts get drowned in the sheer amount of metrics and alerts.
Why are Too Many Metrics and Alerts Bad?
If you encounter grave failures that threaten your business, you become cautious. You don't want anything to slip through the cracks, so you add everything you can find. But in the end, it drowns the important data that could have saved you from the next incident that was about to happen.
If you can't see the forest full of trees, then you're back at point one---with no insights. The crucial question is, how much data can you and your team, the humans looking at the monitoring dashboards, reasonably perceive?
All the metrics you don't need can distract you from the important parts. They can also lead you to optimize for metrics that don't matter.
Should You Get Rid of Your Metrics?
Short answer: No, you shouldn't.
Long answer: All these metrics are potentially shadowing important information. Should you stop storing them at all? Well, the problem isn't that the metrics are saved somewhere. While this can become a financial and performance problem under certain circumstances, it isn't a direct problem for your operations. The issue is that your teams can't make sense of them. Depending on future requirements, it could be good to have the metrics stored somewhere, so you can display them when needed.
You should cut at the dashboards and graphs of these metrics to relieve your team from the informational overload.
Think in Terms of SLAs, SNLOs, and SLIs
A Service Level Agreement (SLA) includes promises you made to your customers in your contracts with them. For example, your service will respond within one second for 99% of all requests. You are legally bound to that promise, so it must never be broken. You can look at the AWS Lambda SLA to see what this means in practice. AWS loses real money when their service is down for too long.
A Service Level Objective (SLO) is how you would redefine the SLA to measure it in a meaningful way. In the one-second response example above, this might be easy to measure, but that is not always the case. SLOs are the thresholds you define for your metrics that should not be broken. The successful response rate should not be under a certain ratio; the response latency should not be over a certain value, etc.
In Figure 1, you can see how such an SLO gets set as Dashbird alert. Here the latency should always be under one second on average.
Figure 1: Dashbird alert for API latency -- a critical-level alarm will alert you via email, Slack, SMS, etc, when API response is above 1,000 milliseconds (1 second)
The Service Level Indicator (SLI) is now where the solution to our hoarding problem lies. An SLI consists of one or more important metrics to check if your system is currently breaking any SLOs. "One or more metrics" is the key phrase. If you can calculate a value from multiple metrics that show that your SLOs are met, you don't have to look at every metric to check if things are going well.
In Figure 2, you can see such an SLI, the duration a Lambda function took to execute in milliseconds.
Figure 2: Dashbird function details for AWS Lambda
It's a top-down approach. You write a contract; it explains your SLAs; you define your SLOs based on these requirements and then set up SLIs so you can check if the SLOs are holding.
Smart Triggers Help Solve Real Business Problems
In the end, alerts are triggered when your SLOs aren't met anymore, or better, way before they aren't met anymore, so you can solve problems that are about to happen.
Does disk space or CPU utilization play into your SLOs? If not, don't display them.
This doesn't mean you should only define SLOs for SLAs in your contracts. It can very well be that your contract doesn't mention something that could be an important SLA; after all, contracts aren't perfect, and your customers could still become angry if you fail to deliver something they expect. This only means they can't sue you in the end, which is important for your company, but only the basic minimum.
Dashbird was Built with Best Practices in Mind
Dashbird comes with plenty of serverless know-hows out of the box. After all, it was created because the founders firsthand saw how the lack of observability or too much information could get in the way of product development.
After you integrate Dashbird with your AWS account, it starts to collect monitoring data from CloudWatch and builds important metrics for your infrastructure right away without any additional coding.
Dashbird sets up metrics and alerts for all supported AWS resources, so you don't have to. These are based on years of experience with monitoring serverless systems for Dashbird customers and, of course, the AWS Well-Architected Framework, the official resource from AWS for building and maintaining applications on the AWS cloud.
Figure 3 shows an example of these insights. A Lambda function runtime is upgradable, this is important because runtime versions aren't supported forever. Dashbird shows that upgrade possibility way before the current runtime version isn't supported anymore.
Figure 3: Dashbird Well-Architected Lens
Dashbird only shows you what's important, so your team doesn't get overwhelmed. This gives room for your team to add their SLIs and SLOs for the SLAs defined in your contracts, while highlighting crucial metrics you should keep in mind.
Further reading:
Top comments (0)