How to Choose Monitoring Tools for DevOps and SRE

#sre #devops

Originally published on Failure is Inevitable.

When developing for reliability or implementing resilient DevOps practices, the heart of your decision-making is data. Without carefully monitoring key metrics like uptime, network load, and resource usage, you’ll be blind to where to spend development efforts or refine operation practices. Fortunately, a wide variety of monitoring tools are available to help you collect and get visibility into this data.

While it might be tempting to try to monitor absolutely everything in your system, more focused monitoring will be easier to implement and leave you with more actionable data. SRE practices like SLOs are most useful when based on metrics for customer impact. Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post. We’ll also suggest a few popular monitoring tools for your consideration.

Where to implement monitoring

It’s important to decide where in your system architecture you’ll implement monitoring. This will allow you to develop your architecture around the monitoring tool, rather than having to retrofit existing code. Depending on the location of implementation, monitoring tools will be able to observe different types of data. Here’s a breakdown of the most common types of monitoring implementations, along with examples of tools offering that type of monitoring:

Resource monitoring: Also known as server monitoring or infrastructure monitoring, this operates by gathering data on how your servers are running. Resource monitoring tools report on RAM usage, CPU load, and remaining disk space. In architectures with physical servers, information on hardware health—like CPU temperatures and component uptime—can also be helpful to avoid server failure. In cloud-based environments, aggregates of your virtual server system are more useful.

Network monitoring: This looks at the data coming in and out of your computer network. Your monitoring tool captures all incoming requests and outgoing responses across all components such as switches, firewalls, servers, and more. The data collected from network monitoring can be as simple as the total amount of data coming and going or as nuanced as the frequency of particular requests.

Application performance monitoring: APM solutions collect data on how an overall service is performing. These tools will send their own requests to the service and track metrics such as the speed and completeness of the response. The goal is to drive detection and diagnosis of application performance issues to ensure services perform at expected levels.

Third-party component monitoring: This involves monitoring the health and availability of third-party components in your architecture. In this era of microservices, it’s likely that your service depends on the proper functioning of external services, from cloud hosting to ad servers. Like application performance monitoring, tools can check the status of these services with their own requests.

You will likely want to include some of each type of monitoring in your overall solution. Prioritize having robust, redundant monitoring tools to ensure potential issues aren’t missed. At the same time, metrics and alerts should be tied to services to ensure relevance with business impact.

What you need from your data

Having actionable data isn’t just about the data itself; in order to respond properly to what your monitoring tools are reporting, you need to have that data presented in the most useful way. Here are some things that monitoring tools can do for you:

Trigger alerts when metrics exceed certain thresholds
Create logs of events, highlighting based on parameters
Create graphs of metrics over time
Provide a dashboard of key service health components at a glance
Create databases of logs that can be queried

When making development decisions or responding to an incident, try to get in the habit of asking yourself, “What would I need to be looking at right now to make the best choice?”. Visualize what data it would contain and the metrics that matter.

Open source vs purchased

Another important point to consider is where you’ll find your monitoring tools and who will maintain them. There are both open source and purchasable tools with their own pros and cons.

Open source monitoring tools

These tools are free, which is an advantage for companies with limited tooling budgets. They’re also completely customizable, allowing you to integrate them into your own architecture. However, this customization will require dedicated development time and perhaps specialized knowledge. Furthermore, there is no SLA guaranteeing availability, security, update frequency, etc. Your team would own these responsibilities.

Purchased monitoring tools

These tools cost but offer robustness that open source tools cannot. The service provider will be accountable for keeping the tool functioning and up-to-date. The provider will likely offer customer service, training, documentation, and other resources to help you integrate the tool with your stack. In the era of reliability, making investments to ensure your monitoring eyes are always open is worth considering.

Comparison of Monitoring Tools

Here are a few of the most popular monitoring tools for SRE and DevOps to consider for your system.

AppDynamics is a monitoring platform focused on APM. Other features they offer include AI-powered insights, end-user monitoring to model customer journeys, and business monitoring with integrated revenue analysis. You can sign up for a free trial.
DataDog is a monitoring platform targeted at cloud-scaled services. It features robust features in visualization, alerting, and data consolidation and analysis. They enable correlating performance metrics with business impact. DataDog offers a free trial.
Prometheus is a popular open source monitoring tool offering alerting, querying, visualization, and many other useful features. The dedicated development community offers plenty of documentation and instruction to help you get up to speed.
New Relic is a monitoring platform offering several components that can also be used standalone: New Relic APM (application performance monitoring), New Relic Browser, and New Relic Infrastructure. They offer applications for iOS and Android, giving you more options for monitoring.
Nagios offers both an open source (Nagios Core) and purchasable option (Nagios XI). They offer a highly customizable interface, and monitoring over your entire IT network. They also highlight their ease of use, with configuration wizards to guide users in setting up new monitoring services.
Dynatrace allows for cross-team collaboration with its monitoring platform, offering a shared single repository of monitoring data. They also include autonomous cloud features and the ability to bring monitoring to the Internet of Things layer of deployment. They also offer a free trial.
Solarwinds offers several products, each specializing in different areas of monitoring: Network Management, Systems Management, Database Management, IT Security, IT Service Management, Application Management, and Managed Service Providers. Each can be tried for free.
Site24x7 specializes in website monitoring, offering tools such as status pages and diagnostics on the health of web servers such as AWS and Azure. They also offer synthetic web transaction monitoring, allowing you to simulate usage and collect metrics. They offer several pricing plans depending on the services required.
SignalFx offers a wide array of microservice integration, allowing you to see a complete picture of service health. This is important if your service contains many third party components. Their focus is on helping build your architecture from a monolithic to microservices model.
PRTG Network Monitor is a complete monitoring service that can integrate into many stages and locations of your architecture. They offer monitoring at the level of networks, individual servers, specific applications, and everything in between. This provider also offers a free version.

No matter what monitoring tools you ultimately use, you’ll want to make the most of the data they provide in context of a larger reliability solution that drives actionability. Blameless helps you transform monitoring data into SLOs and error budgets, and incorporate it into reliability insights To see more of what Blameless can do, join us for a demo!