TL;DR
Monitoring can be complex, especially given the number of available tools and the dynamic nature of environments.
In this article, I compare the most popular monitoring tools to help you choose the right one for your specific needs.
Odigos - Open-source Distributed Tracing
Monitor all your apps simultaneously without writing a single line of code!
Simplify OpenTelemetry complexity with the only platform that can generate distributed tracing across all your applications.
We are really just starting out.
Can you help us with a star? Plz? 😽
https://github.com/keyval-dev/odigos
Introduction
With the growing need for scalable applications, Kubernetes emerged as the standard for managing containerized workloads and services.
It makes deploying and running applications on distributed instances easy, but monitoring the infrastructure can be challenging.
Kubernetes monitoring is the practice of tracking and observing the performance, health, and behavior of your applications and the infrastructure providing them.
It involves collecting and analyzing traces, metrics, and logs to help you detect and troubleshoot issues, and even optimize your clusters for better resource management.
But being such a complex environment, various tools have arisen to address this issue.
Here we’ll explore some of the main solutions and hopefully help you choose according to your needs.
Key metrics to monitor
First, let’s divide our metrics into two groups:
Resource Utilization
These include CPU, memory, and disk usage at the cluster, node, pod, and container levels, and help you make decisions about decreasing or increasing the size of your cluster. It is also important to monitor your cluster with more general metrics such as node availability and health.
- Cluster is a set of nodes that run containerized applications. It is the highest level of abstraction in Kubernetes.
- Node is a physical or virtual machine that is part of your cluster. It can be a virtual machine in the cloud, for instance
- Pod is a group of one or more containers that share storage and network resources. It is the smallest deployable unit in Kubernetes.
- Container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another.
Application Performance
These will depend on the type of your application and business.
For example, an API will provide metrics like response time, request latencies, error rates and throughput.
With that in mind, let’s get to know some of the solutions we can use.
Tools
Kubernetes Dashboard
https://github.com/kubernetes/dashboard
The Kubernetes Dashboard is a web-based user interface made for monitoring and managing Kubernetes clusters. You can access essential information such as CPU and memory utilization, deploy and manage applications running in the pods, and change the amount of resources in the cluster.
It gives you a basic overview of your cluster and it makes it easy to execute some actions, while it is maintained by the Kubernetes community.
But being that simple also means it doesn’t have many options for visualizations, and it also does not have advanced resource metrics.
cAdvisor
https://github.com/google/cadvisor
cAdvisor is an open-source tool developed to monitor containers, and since Kubernetes is a container orchestrator, we can use it too. It can help you collect, process, and export container metrics such as CPU and memory usage. By default, it exists on every Kubernetes node, and it can even expose Prometheus metrics. It is one of the more basic Kubernetes-native monitoring tools.
It is built into Kubernetes and is easy to use, but it is also basic and has limited functionality. It is usually used together with Prometheus and Grafana.
Prometheus
Prometheus is the leading solution for open-source monitoring, widely recognized as the primary standard for monitoring Kubernetes. It is a fundamental component of the Cloud Native Computing Foundation (CNCF).
Prometheus consists of three main components:
- Server: This component is responsible for managing and storing the metrics collected.
- Alert Manager: It handles alerting and notification functionalities.
- Exporters: Exporters are in charge of generating and exporting metrics.
Exporters gather metrics from various sources, and the server stores them in a database for analysis and visualization.
Prometheus offers several noteworthy features:
- An intuitive query-based system.
- Built-in alerting capabilities.
- A thriving and extensive community for support.
One thing to note is that Prometheus lacks a built-in visualization interface. Therefore, it is common to complement Prometheus with a tool like Grafana, which is another open-source project. Grafana not only offers pre-built dashboards for Kubernetes but also enables users to create custom visualizations to suit their specific needs.
ELK stack (and OpenSearch)
https://www.elastic.co/pt/elastic-stack
https://opensearch.org/
The ELK stack used to be an open-source monitoring solution for Kubernetes, but Elastic decided to close it with proprietary licenses.
-
ELK stands for:
- ElasticSearch: A database engine for storing and searching data.
- Logstash: Captures and processes logs, then sends them to ElasticSearch.
- Kibana: A data visualization tool.
In response to Elastic's decision, AWS forked ElasticSearch and Kibana to create OpenSearch and OpenSearch Dashboard. As of now, they remain relatively similar to their ELK counterparts.
Advantages of choosing the open-source option:
- It includes some security and analysis features that are paid in the ELK stack.
- Both ELK and OpenSearch have strong communities.
- They are easy to deploy and use with Kubernetes and offer rich analysis capabilities.
However, there are disadvantages:
- They can be challenging to maintain at scale.
- Often paired with Apache Kafka for buffering data with large volumes.
- While the closed source ELK has a free tier, payment is required for some features.
Datadog
Datadog: When exploring beyond open-source options, Datadog emerges as a comprehensive full-stack monitoring solution.
Key Features:
End-to-End Monitoring: Datadog offers robust infrastructure, security, and application monitoring features, covering the entire spectrum of your systems.
Data Insights: You can monitor requests, traces, logs, and correlate these diverse data sets to derive valuable insights.
Resource Metrics: Datadog provides detailed metrics on resource utilization.
Data Coverage: Once configured, Datadog efficiently gathers data from across your architecture, offering a holistic view of your systems.
But it also has its disadvantages:
Complex Initial Setup: Please note that the initial setup may be somewhat intricate, requiring configuration file adjustments.
Budget Considerations: Keep in mind that while this broad data collection is beneficial, it can be expensive and should be managed prudently to avoid unnecessary expenses.
Dynatrace
Dynatrace is a comprehensive, paid full-stack monitoring solution.
It excels in monitoring the availability and health of your Kubernetes clusters while enabling unified monitoring across a wide array of tools, including services in AWS and Google Cloud.
The platform is known for:
- user-friendly setup
- highly effective for tracking metrics in complex, distributed systems.
However, it comes with a notable investment requirement.
For those who prefer not to handle monitoring setup and infrastructure management, Dynatrace offers a compelling option.
It's often chosen over Datadog when the focus is primarily on application monitoring rather than monitoring infrastructure resource usage.
Odigos
Odigos is an open-source platform that generates distributed traces, metrics, and logs instantly, without code changes.
It focuses on the automatic instrumentation and collection of the telemetry data from all your applications and works alongside traditional monitoring solutions.
It’s great for beginners who want to instrument their applications quickly, without touching their code.
It’s also great if you are unhappy or stuck with your current monitoring solution and want to try other options.
Just install Odigos, and it will:
- Automatically instrument all of your applications
- Collect and manage the telemetry data, and
- Send it to the monitoring vendor of your choice.
Instantly
It supports all the major open-source and managed solutions and is very easy to install and implement.
Wrapping up
With so many Kubernetes monitoring solutions available, it can be difficult to choose the right one for your needs. Now that you have a basic understanding of the different options available, you can start to narrow down your choices and choose the solution that best meets your specific requirements.
The main factors to consider when choosing a Kubernetes monitoring solution:
- Features: What features are most important to you
- Cost: Open source (typically free to use) vs paid solutions (can be expensive).
- Ease of use: Some solutions are easier to set up and use than others.
- Scalability: Can the solution scale to meet your needs as your cluster grows?
Top comments (10)
Definitely going to give odigos a spin. That and Prometheus. Both would be for my self hosted server stuff.
Happy to hear any feedback 😃
You also have have Zabbix, takes a little tweaking but worth the effort. We use Zabbix as it's flexible and the data insert API is so simple a preschooler could write a client side, we monitor about about 250m data points every 24hrs.
IMHO: in 2023 zabbix works only for infrastructure monitoring like hypervisor or windows/linux server/desktop, but for micro-services or local development better to use something like VictoriaMetrics/Prometheus/Netdata. I think that you know zabbix issues with SQL databases and how fast its database grow from GiB to TiB.
I’m a huge fan of Datadog and their offerings. Check out their latest keynote where they show how they’ve solved a lot of the setup complexity.
Also, can you share what information you used in your research? It would be cool to understand the context of that with your pros and cons.
I recommend to try docs.victoriametrics.com/Single-se... for self hosted which is support ingestion metrics from DataDog or victoriametrics.cloud .
I ended up with hyperdx it's also a good choice and is open source and can be self hosted. My previous choice was uptrace which is also not bad.
Just a note: you start the article talking about K8s but most of these tools are not specific to that technology. Otherwise a good roundup! I'd also add SignalFx/Splunk to the list, though they are expensive (see-also: Datadog)
New Relic is pretty good too. It has synthetic monitoring to test and monitoring API.
downhound.com/ provides a good overview of external services that are down, in case you use any of those. Monitors several hundred services. (Disclosure: I'm the developer.)