Amarachi Iheanacho for Eyer

Posted on Jul 29 • Originally published at eyer.ai

The role of baselines in anomaly detection

#ai #aiops #monitoring #machinelearning

Artificial intelligence and machine learning are quickly making their way into every facet of life, including art, customer service, engineering, and, more recently, anomaly detection, particularly through tools like Eyer.

Anomaly detection was once a repetitive and labor-intensive task, involving countless hours of peering into and analyzing large datasets to identify irregularities or anomalies. However, tools like Eyer now leverage artificial intelligence to automate the process of reading and analyzing large datasets to detect anomalies. But how do you determine if a data point is an anomaly? What constitutes normal behavior? These are the questions that baselines provide some answers to, and this article explores what baselines are and how they are used in anomaly detection and other industries.

What is a baseline?

In anomaly detection, baselines serve as reference points or models that represent the normal behavior of a system or dataset under normal conditions. These baselines are crucial for identifying deviations in the data that may indicate anomalies or outliers. A baseline includes lower and upper boundaries, creating a band within which a metric is expected to stay under normal conditions.

Baselines are typically created using historical data. They can be derived from the mean or median of the dataset. Alternatively, you can define baselines using percentiles. For example, any data point outside the 5th or 95th percentile, which in this case are the lower and upper threshold of the baseline, might be flagged as an anomaly.

Additionally, you can also derive baselines with learning models like linear regression or decision trees. These models can capture relationships in the data and highlight deviations from those relationships. Additionally, clustering algorithms like K-means can be used to define normal clusters of data points. Points that don't fit well into any cluster can be considered anomalies.

While baselines are normally derived from historical data, you must note that they are typically subject to change and, therefore, continuously update as new data flows in.

Now that you understand baselines let's dive into the various methods for identifying baselines for your dataset.

What is baseline detection?

As the name suggests, baseline detection is the process of discovering what a baseline for a dataset is. Introduced briefly in the previous section, there are different methods of baseline detection:

Statistical methods: These techniques rely on statistical properties of the data to define a range of normalcy; these techniques include:
- Mean and standard deviation: This approach defines a normal range based on the mean value and its standard deviation. Data points outside a certain number of standard deviations from the mean can be considered anomalies.
- Percentiles: This approach defines normal behavior using percentiles. For example, the 5th percentile and the 95th percentile might represent the lower and upper bounds of normal behavior. Points that fall outside this range are flagged as anomalies.
Time series analysis: When dealing with data collected over time, specific methods can be used to identify the underlying baseline trend:
- Autoregressive models: These models predict future values based on past data points, essentially creating a baseline for what the next data point should look like.
- Moving average: This method smooths out short-term fluctuations by averaging a series of past data points. This helps highlight the longer-term trends, making it easier to identify deviations from the baseline.
Machine learning models: Machine learning offers powerful tools to automatically learn the baseline from your data. Some of these tools are:
- Simple models: Linear regression, for instance, can establish a baseline by capturing the underlying relationships within the data. Deviations from this baseline might indicate anomalies.
- Clustering: Clustering algorithms like K-means can group similar data points together. Points that don't fit well into any cluster are potential outliers or anomalies.

Why does baseline detection matter?

Baselines and baseline detection are important for different applications. Some of these applications are:

Anomaly detection: This is one of the primary use cases that comes to mind when discussing baseline detection. By identifying data points or events that stray significantly from the established norm, anomaly detection helps us spot potential problems. This is crucial in industries like observability, where tools like Eyer leverage baselines to flag anomalies for further investigation.
Quality control: In manufacturing processes, baselines can be established for various parameters like temperature, pressure, or component dimensions. Baseline detection helps identify products deviating from these expected values, potentially indicating defects. This allows for early intervention and ensures product quality.
Predictive maintenance: Baseline detection can be used to monitor equipment performance over time. By establishing baselines for normal operating parameters such as vibration levels, temperature, and energy consumption, deviations can be identified before they become critical failures. This allows for proactive maintenance, minimizing downtime and repair costs.

Finding these deviations from the norm is precisely what makes baseline detection so valuable. Next, let's take a closer look at a specific use case–anomaly detection with Eyer–dissecting how this new age tool approaches baselining.

Eyer’s approach to baselining using multiple baselines

Eyer is an AI-powered observability tool that leverages baselining for discovering anomalies in a system. Eyer approaches baselining in a very interesting way: it understands that each metric is unique and caters to each as such. For each unique metric, Eyer builds baselines using a combination of autoregressive and clustering models. These baselines, which are built from historical data, consist of upper and lower thresholds.

The term "baselines" is intentional because Eyer can build up to three baselines for a single metric: a primary (or main) baseline and one to two secondary baselines. These baselines can account for different normal behaviors of the same metric on the same day. For example, on some Mondays at noon, CPU utilization might be at 30%, while on others, it could be at 70%, and both are considered normal. However, if 30% utilization is slightly more frequent, it will be the primary baseline, with 70% as a secondary baseline.

The main baseline represents the most frequent behavior and is considered anomaly-free. The secondary baselines represent less frequent behaviors that could still be normal but might occasionally conceal some anomalies.

The thresholds that makeup baselines are learned automatically and are dynamic. They are learned and relearned based on past behaviors, and these thresholds are adopted and learned if any changes occur in the system. So, there is no need for manual actions to set up the monitoring systems, as the AI algorithm learns by itself.

But what role does baselining play in an Eyer anomaly alert?

How does Eyer build out an anomaly alert using baselining?

With these multiple baselines defining normal behavior, it becomes easier to spot anomalies in the data.

It is easy to think of any data point that exists outside the established baselines as an anomaly, but it isn't always marked as one. The data point behavior needs to meet a couple of requirements before being classified as an anomaly.

The verification phase determines whether a deviation is an anomaly. In the first part of the verification phase, some deviations can be ruled out through trend analysis. For example, if the data points are only slightly outside the baselines but the overall trend appears normal, they are not considered deviations and thus not considered anomalous.

After this, a 15-minute verification window is used to monitor data for anomalies. If data deviates from normal behavior for at least 8 minutes within this window, that behavior is classified as anomalous, and the corresponding data point is flagged as an anomaly.

Conversely, if a data point falls outside the baseline for less than 8 minutes within the 15-minute verification window, the anomaly is considered closed.

However, identifying a data point as anomalous is just the beginning. The next step is figuring out how anomalous that data point really is.

Classification of anomaly alerts
An alert can include anomalies on several metrics. Each anomaly on each metric has an assigned severity. The overall severity of the alert is based on the severity of the anomalies contained in the alert.

The severity of the anomaly on a single metric
After confirming a data point as an anomaly, Eyer assigns it a weight based on how significantly it deviates from the baselines. These weights are categorized as follows:

Maximum weight: A data point receives a maximum weight of 2 if it exists far outside all predefined baselines.
Medium weight: This weight, valued at 1, is assigned to a data point that exists beyond the primary baseline but remains within one of the secondary baselines.
Zero weight: When a data point temporarily returns to the main baseline after deviating, it receives a weight of zero.

Depending on how long an anomaly remained outside the main or secondary baselines—these weighted deviations are averaged to form the anomaly's history. This average of weighted deviations is then translated into an anomaly score ranging from 0 to 100, where 0 indicates a critical anomaly, and 100 indicates an anomaly-free state.

This anomaly score, which you can refer to as AS, is then used to describe the severity and likelihood of behavior in a data point being anomalous and potentially impactful. The higher the AS, the less likely the behavior is anomalous. Here's a breakdown of what the AS signifies:

AS > 85: No anomaly. Anomaly scores above 85 indicate that the behavior in the data point can be thought of as primary expected behavior, with minor deviations.
60 < AS <= 85: Low severity. If the anomaly score is greater than 60 and less than or equal to 85, it indicates a low-severity anomaly. This means the data point exhibits minor anomalous behaviors similar to those observed in recent days, weeks, and months. Although the likelihood of the behavior being an anomaly is low, it may occasionally conceal anomalous behavior.
30 < AS <= 60: Medium severity. If the anomaly score is between 30 and 60, it indicates a medium-severity anomaly. This means that the data point behavior may be anomalous but also resembles patterns seen previously, making it less certain as an anomaly.
AS <= 30: Severe. If the anomaly score is less than or equal to 30, it indicates that the anomaly is severe. This means that there is a prevalence of new unseen behavior.

In addition to classifying anomalies by their severity, another perk of Eyer and its anomaly detection is that the metrics are not only learned in isolation. Eyer also uses correlations to group related metrics and their anomalies together, combining them in a single alert and making it easier for root cause analysis.

Correlations in Eyer alerts
Correlations help describe the degree to which two or more variables move in relation to one another. In Eyer, correlations help identify how different metrics influence each other or exhibit similar patterns.

Most metrics have a natural correlation. For example, Process CPU is correlated with the number of executions. This is because each execution of a process consumes CPU resources. As the number of executions increases, the cumulative CPU load from these executions also increases.

After using these baselines to identify anomalies in a metric, determining the severity of those anomalies, and understanding which metrics might be affected by correlations, Eyer packages all this information and sends it out in a comprehensive and succinct alert.

You can see an example of an Eyer anomaly alert in the code block below:

{
  "new": [],
  "updated": [
    {
      "severity": "medium",
      "started": "2024-06-26T18:43:00Z",
      "ended": null,
      "updated": "2024-06-26T19:27:00Z",
      "id": "667c6193d58419f64f4cb403",
      "items": [
        {
          "node": {
            "id": 64,
            "name": "Operating System. undefined",
            "system": {
              "id": 1,
              "name": null
            }
          },
          "metrics": [
            {
              "id": "2ce746c5-1ee3-45d1-b23f-bae56bc5d51a",
              "name": "Committed Virtual Memory Size",
              "metric_type": "int",
              "aggregation": "avg",
              "severity": "severe",
              "started": "2024-06-26T18:42:00Z",
              "updated": "2024-06-26T19:12:00Z"
            },
            {
              "id": "5523ee20-2af2-4b8e-8390-3d2cb4410018",
              "name": "System CPU Load",
              "metric_type": "double",
              "aggregation": "avg",
              "severity": "medium",
              "started": "2024-06-26T19:25:00Z",
              "updated": "2024-06-26T19:26:00Z"
            },
            {
              "id": "a59df24a-e9ec-4c4c-a087-ea1375d4b9c7",
              "name": "Process CPU Load",
              "metric_type": "double",
              "aggregation": "avg",
              "severity": "medium",
              "started": "2024-06-26T19:26:00Z",
              "updated": "2024-06-26T19:27:00Z"
            }
          ]
        }
      ]
    }
  ],
  "closed": [
    {
      "severity": "low",
      "started": "2024-06-26T18:49:00Z",
      "ended": "2024-06-26T19:37:00Z",
      "updated": "2024-06-26T19:37:00Z",
      "id": "667c62f7d58419f64f4cb426",
      "items": []
    }
  ]
}

In the alert above, you have the new alerts array, the updated array, and the closed array of alerts. Check out Alerts- structure and data explained, to understand the structure of the alerts.

According to this alert, an anomaly update has happened in the Operating system node.

This anomaly alert has an overall medium severity because it includes one severe anomaly in the Committed Virtual Memory Size metric. The other metrics in the alert, System CPU Load, and Process CPU Load, have medium anomalies.

The metrics array, which contains both affected and correlated metrics, shows anomalies in the Committed Virtual Memory Size, System CPU Load, and Process CPU Load metrics.

Conclusion

This article has helped you understand the role that baselining plays in machine learning, specifically anomaly detection using historical data.

While "baseline" might seem like a simple reference point, it is the foundation upon which many crucial models and their results are built. Anomaly detection tools like Eyer use baselines to determine if a data point's behavior is anomalous and to gauge the extent of the anomaly. This discernment sets the stage for proactive monitoring and timely intervention, ensuring system reliability and performance.

To learn more about Eyer baselines and start using the Eyer anomaly detection solution, visit the Eyer website.

DEV Community

The role of baselines in anomaly detection

What is a baseline?

What is baseline detection?

Why does baseline detection matter?

Eyer’s approach to baselining using multiple baselines

How does Eyer build out an anomaly alert using baselining?

Conclusion

Top comments (0)

Read next

Generative AI in Media & Entertainment

A tool I made for me, but thought I'd share - Interview Prep Pro

Types of AI Agents

Unlocking the Power of AI Agents: A Comprehensive Guide