Listen to a silence

#monitoring #obsevability

Monitoring and observability start with collecting the various telemetry and other related data. Then, various providers spent years and millions of man-hours on design and development efforts to advance and polish the data processing and analysis capabilities. Database design and development produced by multiple vendors have been tuned to almost perfection, which leads to storing more and more data consumed by those systems. And various collectors and protocols have been designed for helping to collect data for that processing and analysis. And the more complex systems, applications, and integrations we bring to existence, the more telemetry data they produce. So, total observability seems directly tied to the ever-growing processing of larger quantities of data. At least cohorts of Architects, Sales Engineers, Technologists, and other specialists helping with Monitoring and Observability constantly convince us to think that is the case. And I do not say they are all wrong.
In most cases, you need data supporting certain discoveries and conclusions. But I assure you, if you look at your data, you only get a complete picture once you also begin to look at the gaps in the data. So, not only is the data your valuable resource that helps you to come to specific findings about the state of your environment, but “no data,” the gaps in the data are equally valuable for those assessments.

What is “no data ?” The gap in the data means that your environment needs to deliver the expected metrics to your observability platform but failed to do so for one reason or another. We observe a data gap if data is not provided from your source for some time. And the gap in the data, otherwise called “no data,” is data too. Gaps have a time when we discover a “no data” condition. Gaps have a period defined for how long we are not receiving data, and gaps have a key indicating which data we are not receiving. Clearly, “no data” is an event. Are all gaps in receiving specific data types could be treated as ones? The answer to the last question is: “Of course not.” Not all data types could be considered a source of “no data” events. For example, suppose the data itself is irregular versus regular metric collection. In that case, if it is a log or irregular event, you cannot use them as indicative sources of the gaps. But if desired, comparing the timestamp of when that non-periodic data was generated and when it was accepted could be an authoritative source of a “no data” event.

So, “no data” is data. But what “no data” could tell us? The “No data” condition could mean multiple things, but most commonly one of the two, either data could not be collected, leading to a “true gap.” Or data could not be delivered, leading to a “delivery gap.” What is the difference between those two? “True gap” means that the data is not collected. There is no data. It will not appear later on. "True gap" will remain as a gap for all eternity. “Delivery gap” usually means that due to some issue, collected telemetry data could not reach the intended destination in time. “Delivery gap” could not necessarily lead to a “true gap” because some solutions in the Monitoring and Observability domain cache collected data and repeatedly will try to deliver data until that possibility is restored.

Detection of the “true gap” is either straightforward or extremely difficult. How your Observability platform must detect and react to a “true gap” in the data. Suppose a source is generally detectable. For example, the source host is pingable, but some data is missing. This indicates that we have a situation that causes a “true gap.” The complexity of detecting the origins of “true gaps” is related to difficulties in seeing problems within integrations that collect the data from the source. Commonly, “true gaps” are caused by failures in the integration layer. So, you must indicate an issue and verify if the integration is functioning correctly. Collecting telemetry through some proxy and detecting a “true gap” in the telemetry could also be a collector proxy issue. Checking for proxy status and health and proxy restart could be a reasonable option. But as I mentioned, detecting the root causes of the “true gaps” could be challenging. Sometimes very challenging. Frequently, the root causes are hidden in third-party integration configurations and implementations. Establishing some form of monitoring over integrations shall provide a path to combat those challenges.

Detection of the “delivery gap” is usually much more straightforward than the “true gap.” As this is commonly tied to the problems on the network or collection proxies. Monitoring your network performance and availability is one of your key responsibilities. If you detect that part of your network is inaccessible, or your telemetry delivery proxy is not online, be ready for a “delivery gap.” Ensure your alert policies have dependencies and silence “delivery gap” detection if a known root cause is detected. When you create your observability solutions, make sure you pick the one supporting telemetry caching.

Detecting either a “true gap” in the data or a “delivery gap” and proper reaction to those conditions must be among the standard solutions you implement in your Observability platform. Your platform will function accurately only if you guarantee proper data delivery and adequately respond to and address all related issues.

DEV Community

Listen to a silence

Top comments (0)