ObservabilityGuy

Posted on Aug 21

Don't Understand PromQL? AI Agents Help You with Large-scale Metric Data Analysis

#promql #observability

1. Background
In modern cloud-native and distributed system O&M scenarios, Prometheus has become the de facto standard for monitoring and alerting systems. With its powerful time-series data analysis capabilities, PromQL serves as a core tool for O&M engineers and developers to diagnose system performance and locate faults. However, the complex syntax and highly structured nature of PromQL place high demands on the user's expertise. The steep learning curve of PromQL prevents O&M personnel from utilizing its functions efficiently, resulting in low monitoring efficiency and misoperation risks. Additionally, exponential growth in metric volumes, driven by cloud-native ecosystem expansion, has rendered manually written query statements inadequate for massive data and dynamic scenarios.

PromQL Copilot, built on Alibaba Cloud Observability platform infrastructure (SLS and CMS) and the Dify framework, implements an end-to-end closed loop from natural language understanding, knowledge graph, query generation, to execution verification. The system delivers comprehensive functionality, including PromQL generation, interpretation, diagnosis, and metric recommendation, and has been deployed in the CloudMonitor console and observability MCP service. This provides users with an intelligent monitoring query experience, empowering enterprises to lower the O&M threshold and improve AIOps capabilities.

2. Challenges and Solutions of Generating PromQL from Natural Language
The core objective of generating PromQL from natural language is to convert natural language input by users (such as "check instances with service latency exceeding the threshold") into precise PromQL query statements. However, this process faces multiple technical challenges, spanning natural language understanding, domain knowledge integration, and performance optimization of large-scale metric data. The following sections start from the core challenges to analyze its technical difficulties and innovative solutions.

Polysemantic Resolution: The Contradiction Between Ambiguity of Natural Language and Certainty of PromQL

One of the core challenges for PromQL Copilot is converting highly ambiguous natural language intents into precise PromQL query statements. The contradiction between the ambiguity of natural language and the certainty of PromQL runs through the whole system design process.

In Prometheus scenarios, the ambiguity of natural language mainly manifests in the following forms:

Context dependency: The same query "service exception" may refer to "HTTP error rate" or "queue backlog" in different scenarios.
Implicit intent (ambiguous expression): "How is the system load?"
Label semantic mapping: Query the CPU usage of each instance. "Instance" may map to instance or instanceID.

Solutions:

In different contexts, the corresponding metrics vary. For example, ECS and RDS both have descriptions of CPU, memory, and disk usage. In Alibaba Cloud CDN, service exceptions prioritize errors in HTTP status codes. However, in RDS, service exceptions focus more on slow queries; in Kafka, they monitor queue consumer offsets. In addition to guiding users to provide more detailed texts (specifying the cloud environment), we have built a metric knowledge base RAG covering all aspects of cloud services and open-source metrics, which can recommend accurate metric metadata for users' questions. For more information about the metric knowledge base, see section 3 of this chapter.
When the user intent is unclear, we strive to understand the user's requirements by rewriting queries into standardized questions that are more in line with the Prometheus ecosystem. This eliminates ambiguity as much as possible and improves the accuracy of data recall in RAG and LLM. In the query rewriting process, information is enriched based on the user's multiple rounds of questions, extracting the actual questions and corresponding domain labels. For example, "Kubernetes" and "container" are extracted from "Container".
The description of labels in natural language cannot be directly converted to PromQL because of possible ambiguity with actual labels. Based on SLS SPL, we have implemented the feature of obtaining the actual metric labels. When using large models to generate PromQL, we can ensure the accurate conversion from natural language to precise labels.

Domain Knowledge Enhancement: How to Enable Large Models to "Understand" the Prometheus Ecosystem

The Prometheus metric system, query language (PromQL), and ecosystem components form a highly structured domain knowledge network. However, general large models (such as LLaMA and ChatGLM) lack in-depth understanding of this field, leading to issues like semantic deviation, syntax errors, or missing context when directly generating PromQL.

Examples:

Misuse of functions: translate "average latency" as avg() rather than rate().
Ignoring labels: omit key labels, such as job="api-server", resulting in empty query results.
Time-series logic errors: confuse the use of instant vectors and range vectors.

Solutions:

We summarize our experience in writing PromQL statements in the knowledge base, covering common natural language problems and corresponding PromQL writing guidelines. For example, when recalling "How do I use PromQL to find the pod with the largest egress traffic?", the system provides the following PromQL writing instructions. This prompts the large model to achieve higher accuracy.

To find the pod with the largest egress traffic, first calculate the egress traffic rate of each pod. Use `max by (pod_name)(rate(container_network_transmit_bytes_total{}[1m]))` to obtain the traffic rate, and then apply the `topk` function to select the pod with the largest traffic. The final PromQL statement is as follows:
topk (1, max by (pod_name)(rate(container_network_transmit_bytes_total{}[1m])))

Large models like the Qwen series inherently possess certain capabilities regarding PromQL operators in the Prometheus ecosystem. Therefore, it is more important for large models to understand metrics. To this end, we have built a metric knowledge base that covers Alibaba Cloud service metrics and common open-source metrics (such as Kubernetes system and Kafka).
Query engine safeguard. The syntax of PromQL, such as time range and label regular expression, involves many symbol operations, so errors are inevitable for large models. We precheck the generated PromQL through Pre Run Query. If errors occur, we will use the "PromQL diagnosis" feature to fix them.
Metric Knowledge Base: Build a Full Cloud Metric Knowledge Base The Prometheus metric system has shown explosive growth: from Kubernetes core components (such as kube-apiserver and etcd) to custom metrics of microservices, and then to Node Exporter data at the hardware layer, scaling to tens of thousands of metrics. Faced with such a complex metric system, the metric knowledge base has become a core component of PromQL Copilot. It not only serves as a semantic bridge for converting natural language to PromQL but also provides key support for intelligent query recommendation, root cause analysis of exceptions, and cross-service dependency tracing.

Solutions:

Metric source: Alibaba Cloud Observability has been working on Prometheus for many years. It provides a set of complete infrastructure, covering standard and structured metric systems across cloud services (such as ECS, RDS, and CDN) and open-source components (such as Kubernetes and Istio service mesh). In the AI era, these resources can be quickly integrated into a large model-friendly vectorized format.
Knowledge base design: The core goal is to accurately convert user questions into metric data, with a large model-friendly data format. Logically, we have designed structures covering the metric field, problems that metrics can solve, and metric metadata (name, description, type, unit, label name, and label meaning).

Dynamic metric updates: The development of the open-source community and emerging technologies drive frequent changes in metric definitions. For example, with the rapid development of large AI models, the metrics of inference frameworks such as vLLM and SGLang are continuously iterated and updated. The knowledge base of PromQL Copilot achieves dynamic metric updates based on the observability infrastructure and the Dify framework API, synchronizing the latest metrics of cloud services and open-source communities with zero investment.
In addition, we are also exploring the relationship between entity resources and metrics, as well as AIOps capabilities in areas such as root cause analysis of exceptions and alert analysis.

Metric Validity: Hit the User's Actual Stored Data The core value of PromQL Copilot lies in converting natural language intent into executable and valid PromQL queries. However, generating a syntactically correct PromQL does not mean that it can hit the data actually stored by the user. This challenge can be broken down into the following issues:

The metric knowledge base cannot cover user-defined metrics, and users may have customized business requirements. We cannot directly perceive user data, nor can we store it on the AI side due to security and privacy constraints.
Diverse Prometheus ecosystem: In addition to using cloud services or open-source metrics, the user can also use RecordingRule or scheduled SQL to pre-compute data and output new metrics, which also belong to user-defined metrics.
The label key-value pairs of metrics are updated at runtime, and labels of custom metrics also cannot be stored on the AI side.
Solutions:
In addition to using the metric knowledge base, we can also develop data query tools for large models to directly query metrics and label key-value pairs in the user store, and input the user's real-time online data to the large model, thereby obtaining more accurate PromQL.
In the practice of large-scale metric queries, the user's real-time online metric metadata may exceed the model context limit, resulting in slow responses. However, only two or three metrics are actually sufficient to answer the user's questions. For this reason, we have designed a filtering method based on domain keywords, taking into account the operation efficiency and accuracy.

3. Procedure and Common Scenarios

Using PromQL Copilot in the CloudMonitor Console PromQL Copilot has been fully deployed in CloudMonitor and can be used in Prometheus instances. Using Observability MCP Server

Alibaba Cloud Observability provides a unified MCP service, which can be obtained from the code repository on GitHub. This section takes Cherry Studio as an example to show how to use natural language to generate PromQL tools and query data.

Observability MCP Server Configuration
The observability MCP Server is no different from the common MCP server. For more information about its configuration, see the open-source documentation.

Introduction to PromQL MCP Tools
After configuring the MCP Server, we can directly use observability-related tools. The following table shows the tools used in this section:

Parameter Description

PromQL Generation from Natural Language and Query Execution
In the local qwen-max-latest model, enter "Query the number of pods in the arms-prom namespace in the last 15 minutes" to experience the process where the large model thinks, plans, and executes based on the prompts from MCP Tools.

The details are as follows. The tool execution error occurs because the large model does not handle the symbol issues well when using PromQL as an input parameter. However, the large model itself has a self-repair capability, enabling it to execute smoothly and obtain results.

Common Scenarios of Generating PromQL from Natural Language PromQL Copilot effectively answers users' questions in terms of open-source metrics and cloud service metrics. The following are some usage examples.

Open-source Metric Query and Calculation
In the Kubernetes scenario, the use of cAdvisor and KSM metrics is shown.

Container scenarios: query the memory usage of pods

Container scenarios: query unhealthy pods
Varying resource granularity in open-source metrics may yield different yet valid answers to the same question.

Cloud Service Metric Query and Calculation
CloudMonitor supports monitoring of hundreds of cloud services or middleware. Different cloud services have their own defined metrics. Select Elastic Compute Service (ECS), ApsaraDB RDS, Alibaba Cloud Content Delivery Network (CDN), and Server Load Balancer (SLB) from cloud services in computing, storage, and network for demonstration.

Query the CPU usage of an ECS instance.

Query the memory usage of an ECS instance.

Query the CPU usage of ApsaraDB RDS for MySQL.
As the core resource of a database, CPU is a key focus in daily O&M. Sustained high CPU utilization for a prolonged period causes slow database access responses, thereby causing business losses.

Query the traffic usage of the CDN service.

4. Outlook
In the future, we will continue to optimize the accuracy, latency, and user experience of generating PromQL from natural language. Here are the aspects being optimized:

Output the thinking process of large models to reduce the latency of the first token.
Enhance the metric knowledge graph with PromQL operator recommendations and multi-metric association recommendations.
Refine prompts and domain knowledge to improve the output accuracy of PromQL.
Optimize the output format of the MCP tool for PromQL execution.

DEV Community

Don't Understand PromQL? AI Agents Help You with Large-scale Metric Data Analysis

Top comments (0)