One of the main responsibilities of a Site Reliability Engineer (SRE) is to ensure applications and systems are performing the way they should be. Not just making sure they're up and operational, but instead, ask the question are these systems and applications performing the way they should be from a speed, reliability, and automated perspective?
Time and time again organizations are worried about keeping the systems and services up, but not about how they are performing. Some questions you should ask yourself are:
- How are the customers feeling about the performance of the application?
- Is it slow or bogged down?
- Are CDN's set up for caching across the country (if you have a global app)
- How many timeouts are in the logs?
- Are the on-call alerts that occur fixable?
- And many more questions...
In this blog post, I'll help answer these questions and list out a few products that will help you in the application and system performance journey.
If your application and/or platform are new, the biggest goal is to keep the lights on. Maybe it's a startup or an open-source project running on some Lambda Functions or containers. The goal is to ask yourself the question did this thing go down? If not, everyone is happy and feeling good... but then a few problems start to creep in.
A few timeouts here and there, some services going down every once in a while, and before you know it, everyone is complaining about the reliability, speed, stability, and performance of the application.
Keeping applications and systems up/operational is of course important, but once you solve that, it's time to start thinking about how applications are performing in the cloud, on virtualized environments, and even on bare metal. It doesn't matter where the application is running. The goal is to ensure it's performing 100% as much as possible (can't really say all of the time because there's no realistic way to ensure that).
Luckily for you, there's an entire industry and way of thinking when for when it comes to performance. You may have heard of it before - Application Performance Monitoring or APM for short. APM is the management and monitoring of apps, systems, and services. The goal with APM is to detect and diagnose app performance issues. Some APM tools are even starting to implement Artificial Intelligence (AI) to detect these issues before they occur. The industry is beginning to call this AIOps.
APM is the equivalent of a nurse or doctor consistently coming in to check on you if you're at the hospital. If somethings wrong, they'll figure it out.
As discussed in the previous section, the goal of APM is to figure out what's wrong in an application to maintain the expected level of service for your customers and internal leadership/management teams. When you're building an application, scalability and architecture is at the forefront of your mind. Once the app is built, the next step is to think about how to keep it performing, in any environment, as well as possible.
Below are four tools that help with APM (in no particular order):
- Datadog - https://www.datadoghq.com/product/apm/
- New Relic - https://newrelic.com/products/application-monitoring
- Dynatrace - https://www.dynatrace.com/platform/application-performance-management/
- AppDynamics - https://www.appdynamics.com/
When thinking about performance, it's not all about just taking action to solve a problem with an application. Before you can do that, you have to have the data. That way, you can actually take action and solve an issue. Observability Monitoring is the practice of collecting as much data as possible to know when an error occurs in an environment and most importantly when (time, date, etc.) an error happens. Understanding an error is helpful, but knowing when it happens is crucially important. You'll know if an error occurred because of a certain application load, or certain timeframes that an issue happens, and that data can save a ton of time.
Below are four tools that help with observability monitoring (in no particular order):