ObservabilityGuy

Posted on Feb 26

From Symptoms to Root Causes: How MetricSet Explorer Reinvents the Metric Analysis Experience

#analytics #ai

This article introduces MetricSet Explorer, a metric analysis platform that shifts from passive display to proactive root cause discovery via intellig.

The Shift from Observing Metrics to Understanding Them 1.1 Metric Flood and Analysis Deficit With the full-scale migration of business to the cloud and the widespread adoption of microservices architectures, we are entering an era of "hyper-scale observability." Every part of a system generates massive volumes of metric data, which is crucial for gauging system health. However, this abundance of data has given rise to a new challenge—the "metric flood." O&M teams and site reliability engineers (SREs) find themselves drowned in countless monitoring dashboards and endless alerts, suffering from "alert and dashboard fatigue."

At their core, conventional monitoring systems are merely data display platforms. They accurately retrieve data from time-series databases, plot it as curves, and present it to users.

This approach implies a key assumption: that the user knows what to look for and can independently figure out the root cause from a maze of complex curves. This approach works for small systems with few dimensions.

Today, however, a single service can easily encompass hundreds or even thousands of instances, each with dozens of dimensional labels, such as region, zone, and version number. This means a single metric could represent tens of thousands or even millions of independent time series. When an issue occurs, relying on the human eye to sift through them all is like searching for a needle in a haystack. We face a severe "analysis deficit": possessing vast amounts of data, yet lacking the ability to efficiently extract actionable insights from it.

1.2 From Passive Displays to Proactive Navigators
To overcome this dilemma, monitoring tools must undergo a fundamental paradigm shift: an evolution from passive "data displays" to proactive "analytical navigators." We believe that the value of a modern metric analysis platform lies not merely in displaying metrics, but more centrally in helping users understand them. It should function like an experienced SRE—capable of proactively detecting anomalies in massive datasets and guiding users step-by-step to pinpoint root causes.

MetricSet Explorer is designed precisely around this philosophy. Its core idea is to combine proven machine learning algorithms with the troubleshooting expertise of O&M specialists and to productize and automate complex analysis processes. We have built three intelligent analysis engines that together form a powerful analytical "funnel," helping users rapidly filter, focus on, and locate issues within the sea of metric data.

● Anomaly detection engine: Acting as the entry point to the funnel, it automatically inspects all metrics, using statistical algorithms to identify those exhibiting unusual behavior patterns, and highlighting them for the user. It serves as the first filter that differentiates the anomalous from the normal.

● Time-series clustering engine (smart grouping): When a user needs to understand the behavior patterns of different individuals within a dimension, such as the CPU utilization of thousands of pods, this engine automatically groups hundreds or thousands of curves based on pattern similarity. This helps the user quickly identify different "classes of players" within the system, accomplishing pattern recognition from individuals to groups.

● Root cause localization engine (smart drill-down): This is the narrowest part of the "funnel" and the most technically sophisticated component. Once a user selects an anomalous time period, the engine analyzes the contribution of all possible dimension combinations to the overall anomaly. It then directly tells the user which dimension combination is the "culprit" by providing a "root cause score."

These three engines work in concert, transforming the highly manual, experience-dependent analysis process of conventional monitoring into an automated, reproducible analytical workflow.

Interface Layout and Functional Areas

The product interface is divided into three primary areas: the top toolbar, the metric overview area, and the detailed analysis area. This layout ensures a clear information hierarchy while allowing users to quickly switch between different analysis scenarios.

The top toolbar serves as the central control hub for the entire system. From left to right, it contains a storage selector, a metric search bar, a label filter, and an advanced features area. The storage selector allows users to switch between multiple data sources, which is particularly useful for cross-cluster or cross-environment analysis. The metric search bar supports fuzzy matching, enabling users to quickly locate target metrics by searching for their ID or name in Chinese or English.

The label filter is a powerful yet user-friendly feature. In the field of observability, labels are the core dimensions of data, such as service name, region, instance ID, and so on. MetricSet Explorer's global label filter can be applied to all metrics simultaneously, enabling users to easily focus on data within a specific scope.

The following table describes three practical features provided in the advanced features area.

Metric Overview Modes When you enter the system, the metric overview tab is displayed by default. Two display modes are supported: normal view and anomaly view.

In the normal view, metrics are displayed in two categories: golden metrics and basic metrics. Golden metrics are typically the core indicators most representative of system health, such as request latency, error rate, and throughput. This categorization method, rooted in SRE best practices, helps users quickly grasp the key state of the system.

When anomaly detection is enabled, the interface automatically switches to the anomaly view. In this mode, the system runs anomaly detection algorithms on all metrics and sorts them by their anomaly score in descending order. For each metric, anomalies are highlighted with a special color, and the anomaly score is clearly marked. This feature is particularly useful for troubleshooting. When an alert is triggered, O&M engineers can quickly enable anomaly detection, and the system will automatically prioritize the metrics most likely to be problematic.

Each metric card on the overview tab not only displays the curve but also provides shortcuts for quick actions. Clicking a card takes you into the detailed analysis mode to begin a deeper exploration.

Detailed Analysis Mode Detailed analysis is the core capability of MetricSet Explorer. After you select one or more metrics, the interface enters detailed analysis mode, displaying a larger chart and three powerful analysis tabs: drill-down analysis, smart grouping, and smart drill-down.

4.1 Drill-down Analysis
Drill-down analysis is the most commonly used exploration method. Its logic is intuitive: from the whole to the parts, layer by layer.

For example, suppose we notice a spike in the request latency metric. First, we click the metric on the overview tab to enter detailed analysis, where we see the globally aggregated curve. Next, we select a dimension to drill down by, for example, grouping by "service." The system immediately displays the latency curve for each service, and we are likely to find that one particular service has exceptionally high latency.

To dive deeper, we select this anomalous service and then drill down further by "call type."

By analyzing layer by layer, we can eventually pinpoint the specific problematic call. MetricSet Explorer supports multi-level drill-down, with each level retaining the filters from the previous one, forming a complete analytical chain.

It also supports drill-down in ALL mode, a highly practical feature. In ALL mode, the system automatically analyzes all drillable dimensions to find the ones with the most significant data distribution differences. This is particularly helpful when many dimensions are involved and you are unsure where to begin your analysis.

4.2 Smart Grouping
Sometimes, our focus is not on the performance of a specific dimension value but on discovering patterns or groups within the data. The Smart Grouping feature is designed precisely for this purpose.

Smart Grouping is based on time-series clustering algorithms. The user selects one or more dimensions to analyze, and the system clusters all time series based on the similarity of their patterns. The final result is presented as several groups, with each group containing curves of similar patterns.

This feature is particularly valuable for capacity planning and resource optimization scenarios. For instance, when analyzing the CPU utilization of numerous instances, Smart Grouping helps you quickly identify high-load, medium-load, and low-load instance groups, allowing for targeted resource adjustments.

The clustering results are presented in a table, where each row represents a group. The table columns include:

● Group ID: the automatically assigned group number.

● Member: the number of time series in the group and the typical dimension values of its members.

● Curve Preview: A representative curve for that group.

You can click any group to expand it and view a detailed member list and a full curve comparison.

4.3 Smart Drill-down
Smart drill-down is the most technically advanced feature of MetricSet Explorer, capable of performing automated root cause localization.

To use this feature, you first need to select an anomalous time period on the timeline. Based on this period, the system runs the series_drilldown algorithm, automatically analyzing all dimension combinations to identify the specific dimension values that contribute most to the anomaly.

The final results are presented in a table, sorted by root cause score in descending order. Each row contains:

● Root cause pattern: The dimension combination that caused the anomaly, such as "Region=North China,Zone=Zone A."

● Confidence: The contribution level of this pattern to the overall anomaly, expressed as a value ranging from 0 to 1.

● Impact curve: The data curve for this pattern.

● Comparison baseline: The aggregated curve of all other series, excluding the pattern.

This feature dramatically reduces fault localization time. Using conventional methods, an O&M engineer might need to test over a dozen dimension combinations to find the root cause, whereas smart drill-down delivers an answer in just a few seconds.

Advanced Features and Tips 5.1 Multi-metric Comparison In detailed analysis mode, you can add multiple metrics simultaneously for comparison. This is highly useful when analyzing the correlation between metrics. For example, viewing CPU utilization and request latency together can help you intuitively determine if a performance bottleneck is resource-related.

5.2 Query Statement Viewing
For technical users, MetricSet Explorer provides a feature to view query statements. Clicking Query in the upper-right corner of a chart reveals the complete query statement used to generate it. This not only helps in understanding the data source but also facilitates the migration of analysis logic to other platforms or scripts.

5.3 Chart Interaction

You can perform a rich set of interactive operations on the charts:

● Zoom: You can select a specific time range with your mouse to zoom in for a closer view.

● Hover tooltip: You can hover the mouse over the chart to view precise values and timestamps.

● Legend control: You can click items in the legend to hide or show the corresponding curves.

● Collapse/Expand: You can collapse the chart area to focus on the analysis results.

Typical Use Cases Let's illustrate the value of MetricSet Explorer through a few practical scenarios.

Use Case 1: Rapid Fault Localization
An e-commerce platform receives numerous alerts during a promotional event, indicating that the order service response time has exceeded the threshold. An O&M engineer opens MetricSet Explorer and performs the following steps:

Enable anomaly detection. The system automatically ranks the order creation latency metric at the top.
Enter detailed analysis, select the anomalous time period, and initiate smart drill-down.
After analysis, the system identifies the data and reports the root cause: Region=South China + Database Instance=db-05.
Confirm that the db-05 instance is experiencing a disk I/O bottleneck and immediately perform a traffic switch to resolve the issue.
Use Case 2: Capacity Planning
An SRE team needs to evaluate whether to scale up its Redis cluster. They use the smart grouping feature:

Select the redis_memory_usage metric and perform smart grouping by the instance dimension.
The system identifies three groups: 15 high-load instances, 40 medium-load instances, and 25 low-load instances.
The team decides to reroute traffic from the low-load instances to the high-load ones, temporarily postponing the need for scaling.
They use the time-based comparison feature to verify the effect of the adjustment.
Use Case 3: Change Impact Evaluation
A development team has released a new version and needs to assess its performance impact. They use the time-based comparison feature:

View the core metrics and enable a 1-day-ago time comparison.
The curves from before and after the release are overlaid for display.
They discover that the P99 latency of a specific API has increased by 20%.
Combined with drill-down analysis, they pinpoint a newly added database query as the bottleneck.

DEV Community

From Symptoms to Root Causes: How MetricSet Explorer Reinvents the Metric Analysis Experience

Top comments (0)