DEV Community: Aliki92

systemd journal: Leveraging Logs for Deeper System Insights

Aliki92 — Thu, 28 Mar 2024 09:01:34 +0000

“Why bother with it? I let it run in the background and focus on more important DevOps work.” — a random DevOps Engineer at Reddit r/devops

In an era where technology is evolving at breakneck speeds, it’s easy to overlook the tools that are right under our noses. One such underutilized powerhouse is the systemd journal. For many, it’s a mere tool to check the status of systemd service units or to tail the most recent events (journalctl -f). Others who do mainly container work, ignore even its existence.

What is the purpose of systemd-journal?
However, the systemd journal includes very important information. Kernel errors, application crashes, out of memory process kills, storage related anomalies, crucial security intel like ssh or sudo attempts and security audit logs, connection / disconnection errors, network related problems, and a lot more. The system journal is brimming with data that can offer deep insights into the health and security of our systems and still many professional system and devops engineers tend to ignore it.

Of course we use logs management systems, like Loki, Elastic, Splunk, DataDog, etc. But do we really go through the burden to configure our logs pipeline (and accept the additional cost) to push systemd journal logs to them? We usually don’t.

On top of this, what if I told you that there’s an untapped reservoir of potential within the systemd journal? A potential that could revolutionize the way developers, sysadmins, and DevOps professionals approach logging, troubleshooting, and monitoring.

But how does systemd-journal work?
systemd journal isn’t just a logging tool; it’s an intricate system that offers dynamic fields for every log entry. Yes, you read right. Each log line may have its own unique fields, annotating and tagging it with any number of additional name-value pairs (and the value part can be even binary data). This is unlike what most log management systems do. Most of them are optimized for logs that are uniform, like a table, with common fields among all the entries. systemd journal on the other hand, is optimized for managing an arbitrary number of fields on each log entry, without any uniformity. This feature gives this tool amazing power.

Check for example coredumps. systemd developers have annotated all applications crashes with a plethora of information, including environment variables, mount info, process status information, open files, signals, and everything that was available at the time the application crashed.

Now, imagine a world where application developers don’t just log errors, but annotate those logs with rich information: the request path, internal component states, source and destination details, and everything related to identify the exact case and state this log line appeared. How much time such error logging would save? It would be a game-changer, enabling faster troubleshooting, precise error tracking, and efficient service maintenance.

All this power is hidden behind a very cryptic journalctl command. So, at Netdata we decided to reveal this power and make it accessible to everyone.

Netdata vs Prometheus: Performance Analysis

Aliki92 — Mon, 19 Feb 2024 13:51:17 +0000

In an era dominated by data-driven decision making, monitoring tools play an indispensable role in ensuring that our systems run efficiently and without interruption. When considering tools like Netdata and Prometheus, performance isn't just a number; it's about empowering users with real-time insights and enabling them to act with agility.

There's a genuine need in the community for tools that are not only comprehensive in their offerings but also swift and scalable. This desire stems from our evolving digital landscape, where the ability to swiftly detect, diagnose, and rectify anomalies has direct implications on user experiences and business outcomes. Especially as infrastructure grows in complexity and scale, there's an increasing demand for monitoring tools to keep up and provide clear, timely insights.

Our ambition is to be the simplest, fastest and most scalable solution in this domain. However, it's essential to approach it with modesty. A performance comparison between Netdata and Prometheus is not a race for the top spot but an exploration of where we stand today and where improvements can be made. Through this, we hope to drive innovation, ensure optimal performance, and ultimately deliver better value to our users.

The Configuration

We are primarily interested in data collection performance. So, we wanted Netdata and Prometheus to have exactly the same dataset.

Source dataset

500 nodes running Netdata, all collecting system and application metrics, per second. In total there are 40k containers running. Since Netdata agents can expose their metrics in the Prometheus format (OpenMetrics), we used them as a data-source for both the Netdata Parent under test, and Prometheus.

In total, these nodes collect about 2.7 million metrics, all per-second.

All these nodes were running on VMs hosted on separate hardware from the centralization points (Netdata Parent and Prometheus) and were connected to the physical server running the test via 2 bonded 20 Gbps network cards.

In this test we are interested in data ingestion performance, so the servers were not running any queries during the test.

Netdata Parent

The Netdata Parent was receiving all metrics in real-time from the 500 nodes, via Netdata streaming (they were pushing metrics to Netdata Parent).

To emulate the functionality offered by a plain Prometheus installation, we disabled ML (machine learning based anomaly detection) and Health (alerts) in netdata.conf.

We allowed the Netdata Parent to run with its default 3 tiers of storage, to showcase the difference in retention between the 2 systems. This also enabled automatic back-filling of the higher tiers, from data available in the lower tiers, which may have affected the CPU utilization of the Netdata Parent.

We also allowed streaming run with replication enabled, to avoid having any gaps on the data.

Other than the above, Netdata was running with default settings.

Prometheus

Prometheus was scraping the same 500 Netdata instances, every second, in as-collected source type (so that the original data are exposed to Prometheus in their native form).

We configured retention to 7 days. Other than this, Prometheus was running with default settings.

Hardware

Both Netdata Parent and Prometheus were running on a dedicated VM each, with 100GB RAM, 24 CPU Cores, and a dedicated 4TB SSD disk.
Both VMs were hosted on the same physical server having 2x AMD EPYC 7453 CPUs (28-Core Processors - 112 threads total), 512GB RAM, and separate dedicated SSDs for each of them. The host server was configured to run each VM on a separate NUMA node.

Other than the above, this physical server was idle, so that nothing outside this test can influence the comparison.

Screenshots were taken from a Netdata running at the host O/S of this server, using CGROUPs and other system metrics.

CPU Utilization

Image: 24 hours of CPU utilization: Netdata is the purple line. Prometheus is the brown line.

On average, Netdata was using 4.93 cores (493% of a single core), while Prometheus was using 7.56 cores (756% of a single CPU core).

Per million of metrics per second:

Netdata needs 1.8 CPU cores
Prometheus needs 2.8 CPU cores

Based on these observations, Netdata appears to use 35% less CPU resources than Prometheus, or Prometheus to use 53% more CPU resources than Netdata.

In the test duration, Prometheus exhibited fluctuating CPU consumption. A closer examination of the last 7 minutes reveals periodic (every 2 minutes) spikes in Prometheus's CPU usage, often exceeding 14 CPU cores. These might be attributed to periodic operations like internal garbage collection or other maintenance tasks.

Image: 7 minutes of CPU utilization: Netdata is the purple line. Prometheus is the brown line.

We also noticed that Prometheus seemed at times to face short difficulties in scraping sources (CPU utilization dives). In contrast, Netdata showed consistent performance during the same intervals:

Image: short CPU dives in Prometheus Netdata is the purple line. Prometheus is the brown line.

Network Bandwidth

Image: 24 hours network bandwidth: Netdata is the blue line. Prometheus is the brown line.

During our observation, Netdata utilized on average 226.6 Mbps of bandwidth, while Prometheus utilized 257.2 Mbps.

The Netdata streaming protocol, although text-based, is designed to be compact, while Prometheus uses the OpenMetrics protocol, which is more “chatty”. However, the Gzip/Deflate compression used when Prometheus scrapes its sources is generally providing a superior compression ratio compared to the ZSTD compression employed by Netdata in this test.

Nevertheless, our data indicates that Netdata utilizes about 12% less bandwidth than Prometheus, or Prometheus is using 13.5% more bandwidth than Netdata.

Upon closer inspection of several smaller durations, we noticed some fluctuations in Prometheus's bandwidth usage. This could suggest that there were moments when Prometheus might not have scraped all 500 nodes every second, though the exact reasons would need further investigation:

Image: 7 minutes network bandwidth: Netdata is the blue line. Prometheus is the brown line.

In several cases the dives were for longer time-frames like in this screenshot:

Image: 7 minutes network bandwidth: Netdata is the blue line. Prometheus is the brown line.

When Grafana sources data from Prometheus, it fills in missing points from adjacent data points, which can make certain inconsistencies less noticeable in visualizations.

In terms of Netdata's functionality, it is designed for consistent data ingestion. If communication is interrupted (though it was uninterrupted during our test), Netdata includes a replication feature. This allows the system receiving the samples to negotiate and backfill any missed data on reconnection, ensuring that there are no gaps in the time-series.

Memory Consumption

Netdata:

Image: 24 hours Netdata memory usage. Above without shared, below with shared

Netdata is peaking at 36.2 GiB (without shared memory), or 45.1 GiB (with shared memory).

Note that there is a single spike to 45.1 GiB, but we are interested in the maximum memory the system will use, even instantly.

Prometheus:

Image: 24 hours Prometheus memory usage. Above without shared, below with shared

Prometheus is peaking at 60.2 GiB (without shared memory), or 88.8 GiB (with shared memory).

Based on these data, Netdata needs 49% less memory than Prometheus, or Prometheus needs 97% more memory than Netdata.

Even if we exclude shared memory, Netdata needs 40% less memory than Prometheus, or Prometheus needs 66% more memory than Netdata.

Storage Footprint

Prometheus was configured to keep the metrics for 7 days, which results in 3.1 TB of storage:

Netdata was configured to have 3 TB of space, which gives us a variable retention depending on how much data can fit in this storage space. This is what it currently uses:

Netdata provides the API /api/v2/node_instances, at the end of which we can find a break down of the storage used by it:

   "db_size":[{
            "tier":0,
            "disk_used":1677443553144,
            "disk_max":1677721600000,
            "disk_percent":99.9834271,
            "from":1697514184,
            "to":1698339754,
            "retention":825570,
            "expected_retention":825706
         },{
            "tier":1,
            "disk_used":838006961012,
            "disk_max":838860800000,
            "disk_percent":99.8982145,
            "from":1694616720,
            "to":1698339754,
            "retention":3723034,
            "expected_retention":3726827
         },{
            "tier":2,
            "disk_used":193843384704,
            "disk_max":419430400000,
            "disk_percent":46.2158643,
            "from":1679670000,
            "to":1698339754,
            "retention":18669754,
            "expected_retention":40396851
         }],

This is what these numbers mean:

Tier (`tier`)	Capacity (`disk_max`)	Used (`disk_percent`)	Retention (`expected_retention`)
Tier 0	1.5 TiB	100%	9.6 days
Tier 1	763 GiB	100%	43.1 days
Tier 2	381 GiB	46%	467.5 days

Netdata used 1.5 TiB for almost 10 days of high resolution data, while Prometheus used 3.1 TiB for 7 days of high resolution data.

Based on these data, we can estimate the average number of bytes on disk, per data collection sample:

Prometheus collects 2.7 million metrics per second. Over the span of 7 days, it accumulated approximately 1.6 trillion samples. Its storage efficiency can be determined as: 3.1 TiB / 1.6 trillion samples = 2.13 bytes per sample.

It's worth noting an earlier observation, that there might be occasions when Prometheus doesn't scrape all endpoints every second. This can affect the calculated efficiency If the points collected are significantly less.
Prometheus provides a way to get the typical size of a sample on disk:

(rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[7d])) / rate(prometheus_tsdb_compaction_chunk_samples_sum[7d])

In our case, this reports 2.14 bytes per sample, which verifies our estimation.

Netdata, on the other hand, also collects 2.7 million metrics per second. Over 9.6 days, it ingested about 2.2 trillion samples. The storage efficiency for Netdata is: 1.5 TiB / 2.2 trillion samples = 0.75 bytes per sample.

Based on these data, Netdata uses 75% less storage than Prometheus, or Prometheus uses 280% more storage than Netdata for the same dataset.

Additionally, it's pertinent to mention that Netdata used the saved storage space to support over a year of data retention, albeit at a reduced resolution.

Disk I/O

Disk Writes

Image: 24 hours of disk writes: Netdata is red line, Prometheus is pink line

Netdata has an average write speed of approximately 2.7 MiB per second.
Prometheus shows a varied write speed, averaging at 55.2 MiB per second.

Disk Reads

Image: 24 hours of disk reads: Netdata is red line, Prometheus is pink line

Netdata makes minimal disk reads, averaging at 15 KiB per second.
Prometheus, on the other hand, performs intensive reads, averaging at 73.6 MiB per second.

From the data, it can be inferred that Netdata primarily writes data directly to their final position, given its steady write speed. Prometheus exhibits variable write and read patterns, possibly suggesting mechanisms like Write-Ahead Logging (WAL) or other data reorganization strategies.

Summary

Netdata, renowned as a distributed monitoring solution, emphasizes the importance of not being confined to centralization. In our relentless pursuit to enhance and optimize our offerings, we sought to understand how Netdata stands in terms of performance and scalability, especially when juxtaposed with other industry-leading systems.

Here's a concise overview of our insights:

	Netdata Parent	Prometheus
Version	v1.43.0-105-ga84213ca3 (nightly of Oct 29, 2023)	2.44.0 (branch: HEAD, revision: 1ac5131f698ebc60f13fe2727f89b115a41f6558)
Configuration (changes to the defaults)	2.6 TiB storage in 3 tiers Disabled ML Disabled Health	Retention 7 days Per Second data collection
Hardware (VMs on the same physical server)	24 CPU cores 100 GB RAM 4 TB SSD	24 CPU cores 100 GB RAM 4 TB SSD
Metrics offered (approximately, concurrently collected)	2.7 million per second	2.7 million per second
CPU Utilization (average)	4.9 CPU cores (spikes at 8 cores) -35%	7.3 CPU cores (spikes at 15 cores) +53%
Memory Consumption (peak)	45.1 GiB -49%	88.8 GiB +97%
Network Bandwidth	227 Mbps -12%	257 Mbps +13.5%
Disk I/O	no reads 2.7 MiB/s writes -98%	73.6 MiB/s reads 55.2 MiB/s writes +4770%
Disk Footprint	3 TiB	3 TiB
Metrics Retention	9.6 days (per-sec) 43 days (per-min) 467 days (per-hour) +37% (per-sec)	7 days (per-sec) -28% (per-sec)
Unique time-series on disk	8 million	5 million
Bytes per sample on disk (per-sec tier)	0.75 bytes / sample -75%	2.1 bytes / sample +280%
Potential data loss (network issues, maintenance, etc)	No (missing samples are replicated from the source on reconnection)	Yes (missing samples are filled from adjacent ones at query time)
Clustering	Yes (active-active Parents)	No (not natively, possible with more tools)

Other notable differences between Netdata and Prometheus:

Protection from server failures

Prometheus: Implements data commitment to disk every 2 hours and utilizes a write-ahead log for continuous disk writing, safeguarding against data loss during crashes or server failures. The management of this write-ahead log is probably the primary reason Prometheus performs such intense disk I/O consistently.

Netdata: Commits data to disk every 17 minutes with even distribution across metrics and time. This results in minimal disk I/O during metric ingestion. To counter potential data loss from server failures, it employs replication for missing data recovery and re-streaming of received data to another Netdata server (grand-parent, or a sibling in a cluster).

Resource Consumption and Features

The following features of Netdata were disabled during this test:

Netdata’s Machine Learning (ML) feature provides unsupervised anomaly detection. It performs queries on all metrics to learn their behavior by training mathematical models, which are then used for detecting if any given sample is an anomaly or not. This requires higher CPU, memory, and disk I/O usage.

Netdata’s Health feature for alerting with the plethora of predefined alerts that are shipped with Netdata, adds to the CPU consumption and disk I/O.

Re-streaming metrics to a grand-parent or a clustered sibling with compression, adds more CPU consumption (this delta is insignificant if compression on restreaming is disabled).

Generally, enabling all these features will double the CPU consumption and especially for the ML case, Netdata will also perform constant disk reads.

Data and Metadata Design

Prometheus: Organizes its data and metadata in 2-hour intervals, dividing its database accordingly. Depending on the query time-frame, it then accesses these 2-hour segments.

Netdata: Operates with a continuous rolling design. To support a fully automated UI, it maintains an all-encompassing view of metadata available instantaneously across extended retention periods (e.g., months or years). However, in highly transient environments, this can consume significant memory. Improvements have been made, like utilizing a single pointer for any label name-value pair, but challenges remain with extreme transiency paired with lengthy retention.

Conclusion

The relentless dedication of the Netdata team has birthed significant advancements across all facets of our platform. The extensive rework of the dbengine, our time-series database, stands testament to this progress, ensuring enhanced performance and resilience. As we further innovated, the introduction of advanced ML capabilities, the use of more sophisticated compression algorithms, a revamped SSL layer, and a notably reduced memory footprint have added to Netdata's prowess. Our commitment to the community shines through in our new UI, equipped with innovative features that simplify monitoring and boost its clarity.

Netdata's open-source nature isn't just a technical classification; it's a philosophy. We view our software as a gift to the global community, underscoring our commitment to democratizing advanced monitoring solutions.

Writing in C has undeniably posed challenges, especially given the unforgiving nature of the language. Yet, this hurdle has only pushed us to exceed our boundaries. Over the past few years, our dedication has led to the restructuring of our entire codebase, resulting in improved performance and unparalleled stability.

Recognition, while not our sole driver, does inspire us further. Leading the observability category in the CNCF landscape in terms of stars showcases the immense trust and appreciation we've garnered. This palpable user love propels us forward, driving us to continually meet and surpass expectations.

Our mission goes beyond mere development; it's about trust. We've tirelessly worked to ensure that user data is not just stored, but managed with the utmost reliability by Netdata. Our pursuit of performance isn't just a benchmark to achieve, but a core tenet of our goal to make monitoring straightforward, affordable, and accessible for everyone.

In the ever-evolving realm of monitoring, Netdata remains unwavering in its commitment to innovation, excellence, and community.

Monitoring VS Observability

Aliki92 — Mon, 05 Feb 2024 10:18:25 +0000

As systems increasingly shift towards distributed architectures to deliver application services, the roles of monitoring and observability have never been more crucial. Monitoring delivers the situational awareness you need to detect issues, while observability goes a step further, offering the analytical depth to understand the root cause of those issues.

Understanding the nuanced differences between monitoring and observability is crucial for anyone responsible for system health and performance. In dissecting these methodologies, we'll explore their unique strengths, dive into practical applications, and illuminate how to strategically employ each to enhance operational outcomes.

To set the stage, consider a real-world scenario that many of us have encountered: It's 3 a.m., and you get an alert that a critical service is down. Traditional monitoring tools may tell you what's wrong, but they won't necessarily tell you why it's happening leaving that part up to you. With observability, the tool enables you to explore your system's internal state and uncover the root cause in a faster and easier manner.

The Conceptual Framework
Monitoring has its roots in the early days of computing, dating back to mainframes and the first networked systems. The primary objective was straightforward: keep the system up and running. Threshold-based alerts and basic metrics like CPU usage, memory consumption, and disk I/O were the mainstay. These metrics provided a snapshot but often lacked the context needed for debugging complex issues.

Observability, on the other hand, is a relatively new paradigm, inspired by control theory and complex systems theory. It came to prominence with the rise of microservices, container orchestration, and cloud-native technologies. Unlike monitoring, which focuses on known problems, observability is designed to help you understand unknown issues. The concept gained traction as systems became too complex to understand merely through predefined metrics or logs.

Monitoring: The Watchtower
Monitoring is about gathering data to answer known questions. These questions usually take the form of metrics, alerts, and logs configured ahead of time. In essence, monitoring systems act as a watchtower, constantly scanning for pre-defined conditions and alerting you when something goes awry. The approach is inherently reactive; you set up alerts based on what you think will go wrong and wait.

For instance, you might set an alert for when CPU usage exceeds 90% for a prolonged period. While this gives you valuable information, it doesn't offer insights into why this event is occurring. Was there a sudden spike in user traffic, or is there an inefficient code loop causing the CPU to max out?

Observability: The Explorer
Observability is a more dynamic concept, focusing on the ability to ask arbitrary questions about your system, especially questions you didn't know you needed to ask. Think of observability as an explorer equipped with a map, compass, and tools that allow you to discover and navigate unknown territories of your system. With observability, you can dig deeper into high-cardinality data, enabling you to explore the "why" behind the issues.

For example, you may notice that latency has increased for a particular service. Observability tools will allow you to drill down into granular data, like traces or event logs, to identify the root cause, whether it be an inefficient database query, network issues, or something else entirely.

Key Differences between Monitoring & Observability

Data
Monitoring and observability rely heavily on these three fundamental data types: metrics, logs and traces. However the approach taken in collecting, examining and utilizing this data can differ significantly.

Both monitoring and observability rely on data, but the kinds of data they use and how they use it can differ substantially.

Metrics in Monitoring vs Observability
Metrics serve as the backbone of both monitoring and observability, providing numerical data that is collected over time. However, the granularity, flexibility, and usage of these metrics differ substantially between the two paradigms.

Monitoring: Predefined and Aggregate Metrics
In a monitoring setup, metrics are often predefined and tend to be aggregate values, such as averages or sums calculated over a specific time window. These metrics are designed to trigger alerts based on known thresholds. For example, you might track the average CPU usage over a five-minute window and set an alert if it exceeds 90%. While this approach is effective for catching known issues, it lacks the context needed to understand why a problem is occurring.

Observability: High-Fidelity, High-Granularity and Context-Rich Metrics
Observability platforms go beyond merely collecting metrics; they focus on high-granularity, real-time metrics that can be dissected and queried in various ways. Here, you're not limited to predefined aggregate values. You can explore metrics like request latency at the 99th percentile over a one-second interval or look at the distribution of database query times for a particular set of conditions. This depth allows for a more nuanced understanding of system behavior, enabling you to pinpoint issues down to their root cause.

A critical aspect that is often overlooked is the need for real-time, high-fidelity metrics, which are metrics sampled at very high frequencies, often per second. In a system where millions of transactions are happening every minute, a five-minute average could hide critical spikes that may indicate system failure or degradation. Observability platforms are generally better suited to provide this level of granularity than traditional monitoring tools.

Logs: Event-Driven in Monitoring vs Queryable in Observability
Logs provide a detailed account of events and are fundamental to both monitoring and observability. However, the treatment differs.

Monitoring: Event-Driven Logs
In monitoring systems, logs are often used for event-driven alerting. For instance, a log entry indicating an elevated permissions login action might trigger an alert for potential security concerns. These logs are essential but are typically consulted only when an issue has already been flagged by the monitoring system.

Observability: Queryable Logs
In observability platforms, logs are not just passive records; they are queryable data points that can be integrated with metrics and traces for a fuller picture of system behavior. You can dynamically query logs to investigate anomalies in real-time, correlating them with other high-cardinality data to understand the 'why' behind an issue.

Proactive vs Reactive
The second key difference lies in how these approaches are generally used to interact with the system.

Monitoring: Set Alerts and React
Monitoring is generally reactive. You set up alerts for known issues, and when those alerts go off, you react. It’s like having a fire alarm; it will notify you when there’s a fire, but it won’t tell you how the fire started, or how to prevent it in the future.

Observability: Continuous Exploration
Observability, by contrast, is more proactive. With an observability platform, you’re not just waiting for things to break. You’re continually exploring your data to understand how your system behaves under different conditions. This allows for more preventive measures and enables engineers to understand the system’s behavior deeply.

Opinionated Dashboards and Charts
Navigating the sprawling landscape of system data can be a daunting task, particularly as systems scale and evolve. Both monitoring and observability tools offer dashboards and charts as a solution to this challenge, but the philosophy and functionality behind them can differ significantly.

Monitoring: Pre-Built and Prescriptive Dashboards
In the realm of monitoring, dashboards are often pre-built and prescriptive, designed to highlight key performance indicators (KPIs) and metrics that are generally considered important for the majority of use-cases. For instance, a pre-configured dashboard for a database might focus on query performance, CPU usage, and memory consumption. These dashboards serve as a quick way to gauge the health of specific components within your system.

Quick Setup: Pre-built dashboards require little to no configuration, making them quick to deploy.
Best Practices: These dashboards are often designed based on industry best practices, providing a tried-and-true set of metrics that most organizations should monitor.
Lack of Flexibility: Pre-built dashboards are not always tailored to your specific needs and might lack the ability to perform ad-hoc queries or deep dives.
Surface-Level Insights: While useful for a quick status check, these dashboards may not provide the contextual data needed to understand the root cause of an issue.

Observability: Customizable and Exploratory Dashboards
Contrastingly, observability platforms often allow for much greater customization and flexibility in dashboard creation. You can build your own dashboards that focus on the metrics most relevant to your specific application or business needs. Moreover, you can create ad-hoc queries to explore your data in real-time.

Deep Insights: Custom dashboards allow you to drill down into high-cardinality data, providing nuanced insights that can lead to effective problem-solving.
Contextual Understanding: Because you can tailor your dashboard to include a wide range of metrics, logs, and traces, you get a more contextual view of system behavior.
Complexity: The flexibility comes at the cost of complexity. Building custom dashboards often requires a deep understanding of the data model and query language of the observability platform.
Time-Consuming: Crafting a dashboard that provides valuable insights can be a time-consuming process, especially if you're starting from scratch.

Netdata aims to deliver the best of both worlds by giving you out-of-the-box opinionated, powerful, flexible, customizable dashboards for every single metric.

Real-World Applications: Monitoring vs Observability
Understanding the key differences between monitoring and observability is pivotal, but these concepts are best illustrated through real-world use cases. Below, we delve into some sample scenarios where each approach excels, offering insights into their practical applications.

Network Performance
Monitoring tools are incredibly effective for tracking network performance metrics like latency, packet loss, and throughput. These metrics are often predefined, allowing system administrators to quickly identify issues affecting network reliability. For example, if a VPN connection experiences high packet loss, monitoring tools can trigger an alert, prompting immediate action.

Debugging Microservices
In a microservices architecture, services are loosely coupled but have to work in harmony. When latency spikes in one service, it can be a herculean task to pinpoint the issue. This is where observability shines. By leveraging high-cardinality data and dynamic queries, engineers can dissect interactions between services at a granular level, identifying bottlenecks or failures that are not immediately obvious.

Case Study: Transitioning from Monitoring to Observability
Consider a real-world example of a SaaS company that initially relied solely on monitoring tools. As their application grew in complexity and customer base, they started noticing unexplained latency issues affecting their API. Traditional monitoring tools could indicate that latency had increased but couldn't offer insights into why it was happening.

The company then transitioned to an observability platform, enabling them to drill down into granular metrics and traces. They discovered that the latency was tied to a specific database query that only became problematic under certain conditions. Using observability, they could identify the issue, fix the inefficient query, and substantially improve their API response times. This transition not only solved their immediate problem but equipped them with the tools to proactively identify and address issues in the future.

Synergy and Evolution: The Future of Monitoring and Observability

The choice between monitoring and observability isn't binary; often, they can complement each other. Monitoring provides the guardrails that keep your system running smoothly, while observability gives you the tools to understand your system deeply, especially as it grows in complexity.

As we continue to push the boundaries of what's possible in software development and system architecture, both monitoring and observability paradigms are evolving to meet new challenges and leverage emerging technologies. The sheer volume of data generated by modern systems is often too vast for humans to analyze in real-time. AI and machine learning algorithms can sift through this sea of information to detect anomalies and even predict issues before they occur. For example, machine learning models can be trained to recognize the signs of an impending system failure, such as subtle but unusual patterns in request latency or CPU utilization, allowing for preemptive action.

Monitoring and observability serve distinct but complementary roles in the management of modern software systems. Monitoring provides a reactive approach to known issues, offering immediate alerts for predefined conditions. It excels in areas like network performance and infrastructure health, acting as a first line of defense against system failures. Observability, on the other hand, allows for a more proactive and exploratory interaction with your system. It shines in complex, dynamic environments, enabling teams to understand the 'why' behind system behavior, particularly in microservices architectures and real-world debugging scenarios.

Netdata: Real-Time Metrics Meet Deep Insights
Netdata offers capabilities that span both monitoring and observability. It delivers real-time, per-second metrics, making it a powerful resource for those in need of high-fidelity data. Netdata provides out-of-the-box dashboards for every single metric as well as the capability to build custom dashboards, bridging the gap between static monitoring views and the dynamic, exploratory nature of observability. Whether you're looking to simply keep an eye on key performance indicators or need to dig deep into system behavior, Netdata offers a balanced, versatile solution.

Check out Netdata's public demo space or sign up today for free, if you haven't already.

Happy Troubleshooting!

The Hidden Costs of Monitoring

Aliki92 — Wed, 31 Jan 2024 10:49:05 +0000

When it comes to monitoring IT infrastructure, the costs you see on the price tag of the tool are often just the tip of the iceberg. Below the waterline, a mass of hidden costs can lurk, which can significantly affect the total cost of ownership.

In this blogpost we will cover the analysis of two traditional monitoring domains, Open Source observability and Commercial Centralized observability solutions, focusing the direct and indirect impacts when implementing these solution. In summary:

IT teams face challenges with conventional monitoring tools due to their complexity, time-consuming setup procedures and steep learning curve.
Traditional monitoring systems often pose a balancing act between data quality and quantity vs system and cost overheads.
Most commercial monitoring systems may result in unforeseen expenses due to data transfer egress costs associated with their operations.
There also often an issue with the retention of data, as storing data for a longer period (on vendor's cloud) can result in elevated costs (leave alone the security aspect).
Most commercial monitoring systems, charge a premium for using their advanced features and can burn a hole in your pockets. The cost of scaling your infrastructure may result in an exponential increase in the monitoring costs.
Detailed monitoring acquired through granular, high-resolution metrics can improve the quality of insights and so reducing the time to troubleshoot. The cost of delayed RCAs can reach 6 digit figures in a matter of minutes.
Tools with enhanced usability decrease the need of training or hiring specialized individuals, individuals with varying ranges of expertise should readily understand and use it.
The adoption of an optimal tool can lead to significant cost savings, increased transparency, and improved system reliability.
There is a need to adopt monitoring tools that are simple, customizable, and scalable to address these challenges and hidden costs.

Prometheus & Grafana (Open Source Monitoring references)

While Prometheus and Grafana are both open-source, they come with considerable hidden costs, mainly due to their complexity and the time required to effectively set them up and maintain them.

Here's a closer look at these:

Setup Complexity

Prometheus and Grafana are powerful, flexible tools, but this flexibility comes with a significant degree of complexity. They both require careful, manual setup and configuration to match your specific use case. Depending on the size and complexity of your environment, this setup could take a significant amount of time and require a high level of expertise, resulting in substantial initial costs.

Maintenance Costs

These tools are not "set and forget". They require ongoing maintenance to ensure they're running correctly, are secure, and are kept up to date. If you don't have the in-house expertise, this could mean hiring new staff or training your existing team, which could lead to substantial long-term costs.

Integration Costs

Getting Prometheus and Grafana to play nicely with all the systems and services you want to monitor might take significant time and effort. Each new service might require a new exporter to be installed and configured for Prometheus, increasing drastically the amount of moving parts in your monitoring stack. Also, monitoring more components and applications requires additional effort to set up visualizations and alerts for them.

Scaling Costs

If your infrastructure grows, so too does the time and complexity required to maintain your Prometheus and Grafana setup. Although there are plenty of solutions that allow such a setup to scale, the complexity and the additional components required make it a lot more complex, adding significant to the overall cost.

Opportunity Costs

The time your team spends setting up, configuring, and maintaining Prometheus and Grafana is time they can't spend on other tasks. This is a hidden cost that many companies overlook.

Prometheus and Grafana are robust tools that have been successfully employed by many organizations around the world, ranging from small businesses to large enterprises. When set up and configured correctly, they can provide comprehensive, detailed insights into your infrastructure and applications. However, their effectiveness, completeness, and efficiency often depend largely on the setup process and ongoing management.

Here are a few points to consider:

Completeness

The level of detail and completeness you get from Prometheus largely depends on how thoroughly it's set up. You need to manually install and configure exporters for each system, component or service you want to monitor. If you miss a system or service, or if an exporter isn't available or misconfigured, you might end up with blind spots in your monitoring.

The task of creating dashboards in Grafana that provide a comprehensive view of a service or component can be time-consuming and requires a deep understanding of both the data being visualized and Grafana's capabilities. Each service or component could expose hundreds or even thousands of individual metrics, and each one of those could potentially need its own panel on a Grafana dashboard.

Often, organizations only create dashboards for the most critical or straightforward metrics due to the time and effort involved in creating and maintaining them. As a result, many potentially useful metrics might be collected by Prometheus but are never visualized or monitored effectively in Grafana. This could lead to a significant blind spot in your monitoring coverage, as important information might be missed. It could also mean that you're not making full use of the data you're collecting, reducing the return on your investment in your monitoring setup.

Fit for Purpose

Prometheus and Grafana are flexible, powerful tools, so they can be configured to fit a wide variety of use cases. However, this flexibility comes at the cost of complexity. Achieving a setup that fits your exact needs often requires deep expertise and a significant time investment.

If the necessary Grafana dashboards haven't been set up in advance for all relevant metrics, incident resolution can be significantly slower than expected. Troubleshooting often involves exploring metrics that might not have seemed important or relevant before the incident. If the dashboards for these metrics aren't already set up in Grafana, teams might have to scramble to create them in the middle of an incident. This can slow down time-to-resolution and add stress to an already high-pressure situation.

While Grafana's flexibility allows it to create virtually any dashboard you can imagine, this requires foresight and manual effort. If you don't anticipate all the metrics you might need in a future incident, you could find yourself unprepared when one occurs.

Moreover, not all issues are predictable, and the metrics that might seem non-critical during a peaceful period might suddenly become important during an incident. Thus, a monitoring solution should ideally be comprehensive, and cover a broad range of metrics by default, to cater to these unpredictable scenarios.

Efficiency & Effectiveness

Grafana allows users to create a large number of highly customizable dashboards, and each dashboard can have numerous panels. This flexibility is powerful, but without disciplined organization and naming conventions, things can quickly become confusing.

When multiple users or teams are creating dashboards, it's easy to end up with duplicates, slightly different versions of the same dashboard, or even experimental or 'quick-and-dirty' dashboards that were created for a specific purpose and then forgotten. Over time, this can lead to a cluttered and confusing environment, where finding the right dashboard becomes a task in itself.

While Grafana does provide features to manage this, such as folders and tags for dashboards, as well as permissions to control who can view or edit them, it relies heavily on user discipline to maintain an organized workspace. However, in a fast-paced environment, maintaining this discipline can be challenging. This could result in users struggling to find the right dashboards when they need them or even creating new ones because they can't find an existing one that suits their needs.

This underlines the importance of governance, clear conventions, and perhaps even periodic clean-ups to ensure Grafana remains efficient to navigate.

So, while Grafana can be highly efficient in terms of data visualization, maintaining organizational efficiency can be challenging, even for smaller organizations.

Scalability

While Prometheus is designed to handle a large amount of data and has built-in support for high availability and federation, managing and maintaining this as your environment grows can be challenging and time-consuming.

Scaling Prometheus and Grafana is not a straightforward process. It requires planning and careful configuration. Sharding, Federation, and utilizing remote storage systems are possible strategies, but each adds its own complexity.

Additionally, Prometheus is not designed to be a long-term data storage system. By default, it only stores data for 15 days. If you need longer data retention, you'll have to integrate Prometheus with a remote storage solution, which adds another layer of complexity.

If you take the simple path and set up multiple Prometheus servers to handle your environment's scale, you might lose a centralized view of your metrics unless you carefully configure federation or use a global view query layer like Thanos or Cortex.

It is common for a larger setup, to find yourself spending more time managing your monitoring system. For example, you'll need to configure and maintain each Prometheus instance, manage data retention policies, and ensure that your Grafana dashboards are kept up to date with any changes in your environment.

So, while Prometheus and Grafana are capable of scaling, it requires considerable effort, planning, and resources to do so effectively.

Alerting

Both Prometheus and Grafana alerting systems are based on defining rules and conditions to trigger alerts, and these rules are separate from the dashboards you create for visualizing your data. In other words, you need to create all alerts, one by one, by hand, as you do for dashboards.

Is it easy to create an alert for any metric that Prometheus collects, whether you have a corresponding Grafana dashboard for it or not. This can lead to situations where you are alerted about a problem with a metric that you don't visualize making troubleshooting a lot more time-consuming as you might have to create a new panel or even a new dashboard to visualize the problematic metric and understand what's going wrong.

Striking the right balance with alerting is tricky - you don't want so many alerts that it creates noise and leads to alert fatigue, but at the same time, you don't want so few alerts that critical issues go unnoticed.

Engineers often take a conservative approach, where the fear of alert noise, the complexity and the time required to set up alerts, or even the lack of expertise, leads to under-monitoring. This can leave an infrastructure vulnerable and significantly increase the time it takes to detect and resolve issues, which can, in turn, impact service quality, customer experience, and even revenue.

In conclusion, while Prometheus and Grafana are powerful, comprehensive monitoring solutions, achieving a setup that is complete, fit for purpose, and efficient often requires considerable expertise, time, and ongoing effort.

Datadog, Dynatrace, NewRelic (Commercial Centralized Monitoring references)

"The Datadog pricing model is actually pretty easy. For 500 hosts or less, you just sign over your company and all its assets to them. If >500 hosts, you need to additionally raise VC money." - wingerd33 on reddit

Commercial monitoring SaaS solutions can offer considerable benefits, including ease of use, reduced maintenance, seamless integrations, and access to advanced features. However, these benefits come with their own challenges, including potentially high costs, the risk of vendor lock-in, data ownership, and privacy concerns.

So, although a commercial solution is usually faster to deploy and easier to use, there are many hidden costs associated with them.

Let’s see a few of them:

Complicated Cost Structure

Commercial solutions often come with substantial costs that scale with the size of your monitored environment. These costs can quickly add up, especially for larger infrastructures or for organizations that require access to high resolution data.

Usually, vendors use their pricing models for differentiating their offerings, ending up charging based on a variety of factors, such as the number of hosts, containers, or services being monitored; the volume of data ingested; the number of custom metrics; the number of high resolution metrics; the volume of data retention; and many more.

On top of this, it is usual for certain features or capabilities to be only available at a premium or additional charges, leading frequently to unexpected costs.

All these make it hard to predict your costs accurately. It's a common complaint among users that the pricing of monitoring providers makes it challenging to predict and control costs. This means that optimizing your monitoring for cost can require a considerable amount of effort and expertise. You'll need a deep understanding of the tool's pricing model and the specifics of your environment. And as your environment and monitoring needs evolve, you'll likely need to revisit these decisions regularly.

While these practices are not universal and many vendors strive to be transparent and fair in their pricing, they still are not rare. When using commercial monitoring providers, the process of cost optimization is an ongoing task that usually requires a level of effort and expertise comparable to maintaining an on-premises monitoring solution like Prometheus and Grafana.

Hidden Indirect Costs

Egress cost (or data transfer cost) can be a substantial hidden cost when using cloud-based monitoring services. Egress costs are the charges that you incur when transferring data from your cloud environment to another location, such as the monitoring provider's servers.

Cloud providers like Amazon AWS, Google Cloud, or Microsoft Azure typically don't charge for inbound data transfer (ingress) but do charge for outbound data transfer (egress). The cost model may seem negligible at first, particularly if data volumes are low or if the cloud provider offers a certain amount of free outbound data transfer. However, when you're continuously streaming monitoring data out of your cloud environment to a SaaS monitoring solution, these costs can add up quickly.

The specific costs will vary based on the cloud provider, the amount of data being transferred, and the geographic locations of the data source and destination. In some cases, transferring data between regions or out of the cloud provider's network entirely can be more expensive.

These costs are typically separate from the fees charged by the monitoring solution itself. While the monitoring provider may charge based on the volume of data ingested or the number of hosts, services, or containers being monitored, they generally don't cover the costs of transferring the data to their service. That means you'll need to account for both the monitoring service's fees and your cloud provider's data transfer costs when calculating the total cost of the solution.

So, if your monitoring strategy involves sending large volumes of data from your cloud environment to a SaaS monitoring service, it's essential to factor in the potential egress costs. Otherwise, you may find yourself facing unexpectedly high cloud bills.

Vendor Lock-In

Vendor lock-in is another significant consideration when evaluating monitoring solutions, as it can lead to hidden costs down the line. It happens when a business becomes overly reliant on a particular vendor and finds it difficult to switch to another solution due to technical, financial, or contractual constraints.

Technical Constraints

Technical constraints are often the most obvious form of vendor lock-in. These occur when a business's systems are so tightly integrated with a particular vendor's product that switching to a different solution would be technically challenging. Examples of technical constraints include:

Proprietary Technology: If the vendor's product uses proprietary technology or standards, it might not be compatible with other systems or tools. This can make it difficult to switch vendors without undergoing a significant technical overhaul.
Data Portability: If the vendor's product doesn't offer a way to easily export your data, or if it stores data in a proprietary format, moving that data to a new system could be a significant technical challenge.
System Complexity: If your monitoring setup is complex or highly customized, replicating it in a new system could be technically challenging.

Financial Constraints

Financial constraints can also contribute to vendor lock-in. These are situations where the cost of switching vendors is prohibitively high. Examples of financial constraints include:

Upfront Costs: Switching to a new vendor might require a significant upfront investment, for example, in new hardware or software.
Transition Costs: These include the costs of migrating data, reconfiguring systems, and training staff to use the new product.
Penalties and Fees: Some vendors may charge fees for early termination of a contract or for exceeding certain usage limits.

Contractual Constraints
Contractual constraints occur when the terms of a contract with a vendor make it difficult to switch to a different product. Examples of contractual constraints include:

Long-Term Contracts: Some vendors might require you to commit to a multi-year contract, which can make it difficult to switch vendors before the contract term is up. Termination Clauses: Some contracts might include termination clauses that require you to pay a fee if you end the contract early.
Exclusivity Clauses: Some contracts might prohibit you from using a competing product for a certain period of time. In many cases, users fail to carefully consider these factors when choosing a vendor and usually don’t have a contingency plan in place for switching vendors if necessary.

How can Netdata Help?

Netdata is designed with efficiency, scalability, and flexibility in mind, aiming to address most of the challenges associated with both open-source tools and commercial SaaS offerings.

Rapid and Easy Setup

Netdata’s automated features allow for quick setup and minimal maintenance. Its auto-discovery capabilities identify metrics data sources instantly, while its pre-configured alerts and ML-powered anomaly detection facilitate automated oversight. Furthermore, Netdata supports Infrastructure as Code (IaC) and offers templates for automated deployment of alerts, fostering an environment for advanced automation. This rapid setup minimizes the cost of infrastructure and time investment, eliminating the complexity associated with tools like Prometheus and Grafana, offering for free an experience that is similar to commercial solutions.

Easy to Use, Opinionated

Unlike Prometheus and Grafana that you have to set up all dashboards metric by metric, Netdata visualizes every single metric collected, without exceptions. Metrics are also organized in a way that is easy to navigate and find any metric. Of course some level of familiarity is required to understand this organization. But this is only familiarity. No skills are required except a basic understanding of the table of contents and the search functionality.

Easy Troubleshooting

While most tools, including commercial solutions, require extensive skills to create or modify dashboards, Netdata employs the NIDL (Nodes, Instances, Dimensions, and Labels) framework, to allow both newcomers and experts quickly filter, slice and dice any chart in all imaginable ways, in a point and click fashion. Furthermore Netdata comes with a unique scoring engine (used by metric correlations and anomaly advisor) to quickly reveal how different metrics relate and affect one another, minimizing time to resolution.

Real-Time, High-Fidelity

Netdata's emphasis on real-time capabilities ensures rapid detection and response to incidents. Its high-resolution data with 1-second granularity allows you to uncover micro-behaviors of applications and services. The system's ability to collect and visualize data in less than a second significantly accelerates troubleshooting, offering significant time and cost efficiencies. Both of these features are offered as standard for every single metric, outperforming most other open-source and commercial solutions.

Scalability and Flexibility

With Netdata, you're not bound by the size or complexity of your infrastructure. Its vertical and horizontal scalability capabilities can meet the most demanding environments.

Additionally, the smart database storage management for long term retention and the ability to set up multiple centralization points offer both efficiency and geographical freedom, addressing data sovereignty concerns and eliminating traffic egress related costs. With Netdata your data is always managed by the open-source Netdata agents you install inside your infrastructure. Your data, your way.

Customizable and User-Friendly

Netdata provides extensive customization options, from a wide range of integrations to the ability to segment infrastructure into rooms and assign user roles and permissions. Additionally, Netdata is designed to be user-friendly, catering to both novice users and experts. The learning curve is gentle, requiring only familiarization with the organization of the dashboard and the configuration methodology.

Secure and Private

Netdata prioritizes user data protection, being designed with a security-first approach. Its adherence to Open Source Security Foundation Best Practices ensures reliable and private operations, countering data privacy and security concerns often associated with commercial SaaS solutions. Even when users use Netdata Cloud, their metrics data is not transferred and is not stored in the cloud. Metrics data pass through Netdata Cloud for the needs of visualizations only when users view Netdata Cloud dashboards, and only for the charts they view.

By combining all the above, Netdata not only meets but often exceeds the capabilities of other tools at a fraction of the cost.