ARMS Continuous Profiling Upgrade for Efficient and Accurate Performance Bottleneck Localization

#tooling #microservices #performance #monitoring

Introduction
As software technology continues to evolve, many enterprise software systems have shifted from monolithic architectures to cloud-native microservices. On one hand, this transformation enables applications to achieve high concurrency, easy scalability, and high development agility. On the other hand, it also results in increasingly long software application chains and growing dependencies on various external technologies, making the troubleshooting of specific issues extremely challenging.

Despite the rapid evolution of distributed systems and their observability technologies over the past decade—addressing a lot of issues to a certain extent—locating specific issues remains extremely challenging. The following figures show several typical examples of common production issues.

Figure 1 Continuous CPU usage peaks

Figure 2 Heap memory space usage

Figure 3 Trace fails to locate the root cause of high latency

Continuous profiling technology is a powerful technique that collects the method stack state information of application-related threads when they apply for relevant resources, then uses visualization technologies such as flame graphs to map the distribution of the corresponding resource usage, and finally identifies the root cause of fluctuations in specific resources during relevant time periods.

Out-of-the-box Continuous Profiling
As an application-observability suite of Alibaba Cloud, ARMS introduced continuous profiling as early as 2022 to help users locate common complex performance issues.

Continuous profiling provides the following three core features:

Code hotspot: locates issues by associating wall clock hotspots with trace information.

● If your business is too complex to reproduce occasional slow calls, the code diagnostics feature can simulate code execution and method calls.

● If the methods and instrumentation at non-framework layers are missing from traces, the code diagnostics feature helps you restore the time consumed for the methods about instrumentation.

CPU hotspot: locates issues by periodically collecting snapshots of method stacks that are running CPU threads.

● When the CPU utilization of your system is high, this feature can quickly locate the business logic method stacks that cause high CPU consumption.

Memory hot spots: locates issues by recording the allocated memory size and number of allocation times when a thread exceeds the heap memory threshold and triggers memory allocation and collecting the method stack snapshots.

● When the heap-memory usage of your JVM is high, this feature can quickly locate the business logic method stacks that request large heap memory or send a large number of memory requests.

After extensive customer adoption and continuous evolution and optimization over the past few years, continuous profiling is upgraded in terms of usability. The following sections introduce each of the key upgrades.

Optimized Storage and Computing Engine: Smooth and Efficient Data Retrieval
Data structures of flame graphs are complex, posing significant challenges for both large-volume data storage and aggregate computing. As a result, a common practice among industry products is to only temporarily enable the feature to collect data for a certain period and perform data aggregation analysis at short intervals. Although the use process of this approach is cumbersome, it can to some extent help locate common performance bottlenecks. However, when dealing with scenarios that are difficult to reproduce, the cost of problem localization becomes extremely high.

In this upgrade, we significantly optimized both the data format and query engine. For continuous profiling, we have enhanced the query intervals and target objects: previously, only 1-, 5-, or 15-minute data aggregation was supported, whereas now it enables second-level aggregation across dimensions including daily granularity, multiple instances, and multiple threads. This upgrade allows continuous profiling to offer not only a broader aggregation scope but also finer-grained dimensions, improving the effectiveness of locating various performance issues.

AI Copilot-powered Flame Graph Analysis: One-click Insight into Performance Hotspots
In the past, we learned from the user side that although the flame graph tool is highly effective for troubleshooting performance issues, the interpreting of flame graphs presents a significant barrier for a large number of customers. To address this, our latest version of continuous profiling now supports AI Copilot-powered flame graph analysis, enabling users who lack expertise in flame graph interpretation to easily identify performance bottlenecks with flame graphs.

Demo

After enabling continuous profiling for an application, select the corresponding application in the console. Taking CPU hotspot issues as an example, on the flame graph page shown in the following figure, click the purple magic wand icon for AI Copilot-powered analysis in the upper-right corner of the flame graph page to trigger the analysis.

Copilot quickly provides an analysis report and recommendations for the flame chart.

The report results show that the java.util.LinkedList.node(int) method takes a long time in the flame graph and occupies a large number of CPU resources.

In addition to analysis and recommendations, the model can also provide targeted code optimization suggestions if you provide code snippets with key information redacted.

After the application code is adjusted and optimized, the feature can also perform regression verification of the optimization results based on the flame graphs generated by continuous profiling.

D*ifferential Flame Charts: Accurately Comparing Performance Differences*
Differential flame graphs compare performance data from two time periods to generate a differential graph, in which red flame indicates performance degrade and blue flame indicates performance improvement. This helps identify functions with significant performance changes over different time periods. This capability proves extremely helpful for locating scenarios where the performance of an application has differential changes over a certain period. The new version of continuous profiling provides out-of-the-box differential flame graph analysis capabilities.

Demo

After enabling continuous profiling for an application, select the application in the console. Taking CPU hotspot issue localization as an example, click the "Data Comparison" button in the upper-left corner of the page to generate a differential flame graph comparing performance of two time periods.

Based on the generated differential flame graph, you can view differential hotspots across different time periods by using AI Copilot to analyze the flame graph.