ObservabilityGuy

Posted on Jul 2

From API to AI Agent: Alibaba Cloud CloudMonitor Command-line Interface (CLI) + Agent Skill in Practice

#cloudmonitor #ai #beginners

This article introduces the Alibaba Cloud CloudMonitor CLI and Agent Skill to enable AI agents to automate observability O&M workflows.

You can use the official CloudMonitor CLI + Agent Skill to allow an AI Agent to securely execute observability O&M jobs.

30 Second Overview

Alibaba Cloud CloudMonitor CLI (aliyun cms2) consolidates capabilities such as integration, configuration, queries, alerting, and management events in the Cloud Monitor Service (CMS) 2.0 console into a unified command line entry. CMS Agent Skill organizes these commands into business workflows for AI Agents.

In the past, O&M automation often started from APIs: querying documents, assembling parameters, writing scripts, and invoking APIs. Now, by using the CloudMonitor CLI + Agent Skill, these capabilities can be organized into standardized workflows that AI Agents can understand, execute, and validate.

For O&M engineers, its value is not "one more tool", but allowing you to describe O&M targets in natural language, and having the AI Agent complete scenario understanding, CLI invocations, API executions, and result validations, transforming repetitive, multi-step, and error-prone observability O&M operations into confirmable, auditable, and reusable automated flows.

Why You Need CLI + Agent Skill

With the continuous growth of business size and infrastructure on the cloud, observability O&M runs through the end-to-end flows of resource integration, metric/log collection, alerting administration, link troubleshooting, Root Cause Analysis, and stability operations, and the O&M workload and operation complexity rise accordingly. At the same time, relying on powerful language understanding and job orchestration capabilities, AI Agents are becoming a new O&M collaboration entry. More and more teams are starting to try handing over repetitive, standardized, and multi-step jobs to Agents for assisted execution, and handing over complex problem troubleshooting to AI for assisted analysis.

However, to allow AI Agents to truly enter the production O&M closed loop, they cannot just stay at the stage of "understanding problems, generating suggestions, or scripts". A stable execution entry for CloudMonitor capabilities, standardized realm flows, necessary manual confirmations, and verifiable execution results are also required. The CloudMonitor CLI + Agent Skill is exactly the capability suite built for this requirement.

CLI + Skill Solutions

Alibaba Cloud CloudMonitor CLI (aliyun cms2) provides a unified, stable, and auditable capability entry. CMS Agent Skill consolidates the business semantics and operation flows in the CloudMonitor realm into workflows that AI Agents can understand and execute. By coordinating the two, AI Agents can start from natural language instructions such as "help me integrate this Container Service for Kubernetes (ACK) cluster into CloudMonitor", and automatically complete scenario detection, parameter generation, CLI invocations, API executions, and result validations.

Unified command tree:
The CLI has covered capabilities such as Integration Center, Prometheus service, application monitoring, Real User Monitoring, alerting center, and Event Center in the CMS 2.0 console. In the future, it will continue to cover capabilities such as Synthetic Monitoring and Grafana dashboards to achieve complete coverage of the CMS 2.0 console.
Native adaptation for AI Agent:
- It provides standardized, clear, and detailed --help information, and supports auxiliary capabilities such as --show-schema and --show-example-body to help AI accurately process various business scenarios.
- By default, it uses -o text to output compact Comma-Separated Values (CSV), significantly reducing AI token consumption.
- Through structured JavaScript Object Notation (JSON) error codes, it supports Agents in automatically making decisions and repairing based on fault reasons.
Skill driver:
The supporting Skill documents consolidate complete business workflows, allowing Agents to complete complex multi-step operations without hard coding.

CLI + Skill Workflow

For O&M engineers, the most intuitive change is: instead of starting operations from console entries or API parameters, they start from a clear O&M target, and the Agent completes subsequent executions and validations according to standard flows. The core of this link is "controllable automation": the Agent will not bypass the O&M system, but executes operations through the unified CLI entry and the business rules consolidated in the Skill. This can not only reduce repetitive labor, but also retain necessary permissions, confirmations, and audit borders.

Installation and Configuration
Install Skill/CLI

You can open the alibabacloud-cms-manage Skill on the Alibaba Cloud Agent Skills portal and follow the instructions on the interface to install the Skill.

After the installation is complete, when the AI Agent uses the Skill, the AI Agent automatically detects and guides you to install or update the Alibaba Cloud command-line interface (CLI) and the cms2 plugin to the required version. You do not need to manually handle environment dependencies.

# Verify that the CLI installation succeeded
aliyun version
# Verify that the cms2 plugin is active
aliyun cms2 --help

Configure Credentials
Multiple credential types such as AccessKey and Security Token Service token are supported. For more information, see Configure identity credentials for Alibaba Cloud CLI.

# Interactive configuration (recommended for first-time use)
aliyun configure

# Non-interactive configuration
aliyun configure set \
--access-key-id YOUR_AK \
--access-key-secret YOUR_SK \
--region cn-hangzhou

Practical Scenario 1 (Integration Center): Integrate Container Service for Kubernetes (ACK) Clusters into CloudMonitor

Business Scenario
The Site Reliability Engineering (SRE) team created an ACK cluster to deploy microservices. The SRE team needs to integrate the metrics of the cluster, such as nodes, pods, and containers, into CloudMonitor.

Usage
You only need to enter the following text in the AI Agent conversation:

Help me check which container clusters in Hangzhou do not have observability capabilities, and help me integrate them.

The Agent automatically completes the entire integration flow. Users only need to confirm at key steps.

Core Capabilities Supported by the AI Agent

Common Scenarios and Prompt Samples for Integration Center
Integrate by resource group: You can integrate all Relational Database Service (RDS) instances in the Beijing area under the default resource group into {workspace} of CloudMonitor.

Integrate by label: You can integrate all Elastic Compute Service (ECS) instances that match the label key={tagKey} and value={tagValue} into {workspace} of CloudMonitor.

Integrate across accounts: You can integrate all AI gateways in the Shanghai area of {resource directory member accounts uid} into CloudMonitor.

Monitoring widget deployment: You can add the integration of the ACK cost Insight widget in the integration policy {policy id/name}.

Metric collection target check: You can check whether the apiserver-related scrape targets of the ACK cluster {cluster Id/name} are normal.

Custom collection rule query: You can query the serviceMonitor/podMonitor/customJob list of the integration policy {policy id/name}.

Practical Scenario 2 (Alerting Center): Intelligent Alert Rule Management

Business scenario
The SRE needs to establish a comprehensive alerting system for the production environment. For example, the SRE configures professional node alert rules for container service cluster nodes.

Usage
The following is a typical conversation sample:

What recommendations do you have for container alerting? Then help me apply them.

Core Capabilities Supported by the AI Agent

Common Scenarios and Prompt Samples for Alerting Center
Intelligently analyze alert rules: You can analyze whether the existing alerts are configured reasonably and whether alert noise exists. If the configuration is unreasonable, you can modify the configuration with one-click.

Query alert rules: You can query all running alert rules of cloud service monitoring in the workspace {workspace}.

Modify Alert Rule contacts: You can change the Notification Recipient of the Alert Rule {rule ID/Name} to {contact}.

Delete an Alert Rule: You can delete the {rule Name} Alert Rule of the Prometheus instance {instance ID/Name}.

Query alerting history: You can query the alerting history of the Alert Rule {rule ID/Name} within 1 week.

Practical Scenario 3 (Prometheus service): Prometheus Instance Management and Data queries

Business Scenario
The O&M team needs to manage multiple Prometheus instances, analyze metrics and business health status, and configure Recording Rules to pre-aggregate high-frequency metrics.

Usage
The following are typical dialogue samples:

Help me check which Prometheus instances are available in Hangzhou, and group them by workspace.

The Following Is an Overview of the Core Capabilities Supported by the Agent

Common Scenarios and Prompt Samples for the Prometheus Service
Modify the storage duration of a Prometheus instance: You can modify the storage duration of the Prometheus instance {instance ID/Name} to 90 Days, and the archive duration to 180 Days.

Create a Recording Rule: You can create a Recording Rule under the Prometheus instance {instance ID/Name} to pre-aggregate the 5 minute average CPU utilization of each edge zone.

Stop a Recording Rule: You can stop the {aggregation Job Name} pre-aggregation Job under the Prometheus instance {instance ID/Name}.

Create a Prometheus aggregation view: You can create an aggregation view {aggregation view Name} that contains all Prometheus instances in the {area Name} area under the {workspace} Space.

Practical Scenario 4 (Application Performance Monitoring (APM)): Application Monitoring/AI Observability Onboarding

The onboarding flow for this scenario includes steps such as initializing the APM infrastructure, obtaining Credentials, registering the application, obtaining the configuration template, and authenticating the onboarding. The traditional onboarding procedure is relatively complex. Using the command-line interface (CLI) + Skill can greatly simplify the flow and achieve natural language interactive onboarding.

Practical scenario 5 (Data query): metadata, Prometheus Query Language (PromQL), and CloudMonitor Basic metric queries

Business Scenario
You can query metadata, Prometheus metric Data, and CloudMonitor Basic metric Data to analyze business running conditions and troubleshoot faults or problems.

Usage
The following are typical dialogue samples:

List of ECS instances with the highest CPU utilization: You can find the 10 ECS instances with the highest CPU utilization in the last half hour.

The Following Is an Overview of the Core Capabilities Supported by the Agent

Common Scenarios and Prompt Samples for Data Queries
RDS slow queries: You can query the Quantity Trend of slow queries with a running time exceeding 1 second in the past 30 minutes.

Waste of container resource Requests: You can find "zombie" resources in the container cluster that have overly large resource requests but very little actual usage in the past 7 Days.

Suspected container pod memory leak: You can find the List of pods under {ns} of the container cluster {cluster Name/ID} whose memory usage has continuously increased in the past 1 hour and whose current value exceeds 90% of the limit.

Summary

Alibaba Cloud CloudMonitor command-line interface (CLI) (aliyun cms2) and the accompanying CMS Agent Skill do not just migrate console and API capabilities to the command line, but also build a standard operation interface for AI agents for observable O&M. It unifies the capabilities scattered across scenarios such as provisioning, configuration, query, alerting, and management events. This allows O&M engineers to express targets more naturally, execute operations in a more controllable manner, and complete authentication and auditing with a clearer link.

For O&M teams, this means that observability construction is gradually moving from the manual stage of "people finding entrances, people piecing together parameters, and people performing authentication" to the collaborative stage of "people defining targets, agents orchestrating flows, CLI executing operations, and AI validating results".

AI is not intended to override O&M judgments, but to significantly reduce the costs of repetitive operations, cross-system collaboration, and complex flow execution, improve the efficiency of troubleshooting and fault localization, and allow SREs to devote more energy to higher-value work such as stability design, alerting administration, and fault review.

In the future, we will continue to enrich the capability scope of CLI and Skill, and comprehensively overwrite CloudMonitor business scenarios. In the AI era, CloudMonitor CLI and Skill hope to become a stable, trusted, and extensible observability capability base between O&M engineers and AI agents, promoting automated and Intelligent O&M from single-point attempts to large-scale implementation.

Appendix—CMS CLI Command Tree

aliyun cms2
│
│                # Integration domain
├── integration                 Integration (includes the full lifecycle of provisioning policies, add-on widgets, collection rules, etc.)
│   ├── policy                  Integration policy Management, including commands such as create, get, update, delete, and list.
│   ├── storage                 Query Prometheus storage instances attached to provisioning policies, including commands such as list.
│   ├── dashboard               Query Grafana dashboards associated with provisioning policies, including commands such as list.
│   ├── resource                Query resources of container service class provisioning policies, including commands such as list.
│   ├── job-target              Query the Status of scrape targets of Collection Tasks of provisioning policies, including commands such as list.
│   ├── service-monitor         Query Kubernetes ServiceMonitor collection rules of provisioning policies, including commands such as list.
│   ├── pod-monitor             Query Kubernetes PodMonitor collection rules of provisioning policies, including commands such as list.
│   ├── custom-job              Query Custom Prometheus Collection Jobs of provisioning policies, including commands such as list.
│   ├── addon-release           Management of deployed widget instances of provisioning policies, including commands such as create, get, update, delete, and list.
│   └── addon                   Management of active provisioning widget folders, including commands such as get and list.
├── workspace                   Workspace Management, including commands such as create, get, list, update, and delete.
│
│               # APP Application Management domain
├── prometheus                  Prometheus service Management (includes Prometheus instances, aggregation views, RecordingRules, etc.)
│   ├── instance                Prometheus instance management, including commands such as create|get|update|delete|list
│   ├── view                    Prometheus aggregation view management, including commands such as create|get|update|delete|list
│   └── recording-rule          RecordingRule pre-aggregation management, including commands such as create|get|update|start|stop|delete|list
├── apm                         application performance monitoring management
│   ├── service                 application performance management (APM) application service management, including commands such as create|get|update|delete|list
│   └── configuration           APM configuration management, including commands such as get|create
├── rum                         Real User Monitoring management
│   ├── service                 Real User Monitoring (RUM) application service management, including commands such as create|get|update|delete|list
│   └── configuration           RUM configuration management, including commands such as get|create
│
│               # alerting and management event domain
├── alert                       alerting center management (including Alert Rule, alerting template, alerting History, etc.)
│   ├── rule                    Alert Rule management, including commands such as create|get|update|patch|delete|list|enable|disable
│   ├── template                Alert Rule template management, including commands such as list|get|create|update|delete|apply
│   └── history                 alerting trigger and recovery History management, including commands such as list
├── notification-channel        notification channel management
│   ├── contact                 alert contact (Email, text message, and DingTalk) management, including commands such as list
│   ├── robot                   alerting robot (DingTalk/Lark/WeCom group robot) management, including commands such as list
│   └── webhook                 Webhook address management, including commands such as list
├── event-hub                   Event Center management, including commands such as list|get
│
│               # Data query domain
├── metric                      Metric query
│   ├── promql                  Prometheus Query Language (PromQL) instant/range query and metadata retrieval, including commands such as query|query-range|labels|label-values|series
│   └── basic                   CloudMonitor 1.0 Metric query, including commands such as points|latest|range|top|export
├── trace                       Trace data query, including commands such as search and tree
├── entity                      Cloud resource and EntityStore query, including commands such as query
└── meta                        Metadata query, including commands such as metrics, namespaces, and events

DEV Community