DEV Community: ObservabilityGuy

Alibaba Cloud Launched a Global Intelligent O&M Platform from "Passive Fire Fighting" to "Active Autonomy"

ObservabilityGuy — Mon, 06 Jul 2026 05:23:52 +0000

This article introduces Alibaba Cloud's STAROps, an AI-native intelligent operations platform that leverages autonomous agents to transition IT manage...

On May 20, Alibaba Cloud officially released the AI-native global intelligent O&M platform STAROps.

The platform uses large model and agent technology as the core engine and Alibaba Cloud observable product system as the data base to deeply integrate cross-domain observable data with large language model reasoning capability. Users only need to define operation and maintenance objectives in natural language, and operation and maintenance agents can independently complete the full closed loop of dynamic planning, safe execution and result verification.

STAROps is designed around four capability dimensions: Sense global perception, Target goal orientation, Autonomy autonomous operationand maintenance, and Resilience business continuity. It provides three core functions:

Intelligent Assistant directly converts natural language into unified query and diagnosis results of cross-domain observation data. Alert analysis, metric interpretation, and log diagnosis are completed in one dialogue window without multi-platform switching.
The long-term task mechanism allows Agent to take over high-frequency repetitive work such as inspection, alarm analysis, periodic reporting, etc., and can independently execute cross-day or even cross-month asynchronous operation and maintenance plans once aligned with predefined objectives.
Digital employees enable enterprises to build dedicated SRE intelligence for each team, customize responsibilities, permissions, and tool sets, and solidify expert experience into role rules at one time.

At the technical architecture level, the competitiveness of STAROps is reflected in four dimensions.

Unified Observability Data

Unifies logs, metrics, traces, events, topology, and changes with PB-scale daily ingestion, EB-scale storage, low-latency analysis, multi-AZ deployment, and 99.95% reliability.

Operational Digital Twin

Builds a unified graph model (UModel) from entities, relationships, observability data, and operational knowledge, helping agents understand systems, trace blast radiuses, and reason about root causes in a shared context.

AI Analytics Operators

Supports anomaly detection, log clustering, trace analysis, performance profiling, and change analysis, reducing the cost of processing massive raw data while improving diagnostic efficiency and result stability.

Continuous Improvement Flywheel

Builds a realistic evaluation loop with simulation, fault injection, diagnostic assessment, and feedback, creating a measurable, roll-back-ready system for continuous agent improvement.

The essence of cloud computing lies in orchestrating computing resources as a service in an efficient way, and what STAROps is doing is extending this principle to operations and maintenance. Manpower-intensive O&M tasks are intelligently performed by using agents to schedule large-scale O&M operations. The digital employee mechanism of STAROps provides enterprises with this progressive path: it not only supports embedding AI in existing processes to improve efficiency, but also supports building a new agent native O&M mode.
**
In terms of access form, STAROps provides a variety of access solutions such as **OpenAPI and MCP integration, page embedding, and mainstream IM access. Enterprises can release value in existing workflows at the lowest migration cost. The built-in manual approval mechanism of the platform ensures that key decision nodes are still under manual control, striking a balance between the efficiency of agent independent execution and security compliance.

Along with the product release, Alibaba Cloud synchronizes the open source UModel unified data model project with the RCA-100 evaluation benchmark set, and jointly launched the "Enterprise Common Semantic Standard Industry Initiative" with more than 10 industry partners and academic institutions such as the Institute of Information and Communications Technology, Xiaopeng Automobile, and the Software Institute of the Chinese Academy of Sciences.

Currently, STAROps has been officially launched on the Alibaba Cloud official website. As AI reshapes every aspect of software development, O&M, as the last line of defense to ensure business resilience, is ushering in a paradigm transition from tool assistance to agent autonomy. Alibaba Cloud uses STAROps as a starting point to push Agentic Ops from concept to production-level implementation.

From API to AI Agent: Alibaba Cloud CloudMonitor Command-line Interface (CLI) + Agent Skill in Practice

ObservabilityGuy — Thu, 02 Jul 2026 10:14:34 +0000

This article introduces the Alibaba Cloud CloudMonitor CLI and Agent Skill to enable AI agents to automate observability O&M workflows.

You can use the official CloudMonitor CLI + Agent Skill to allow an AI Agent to securely execute observability O&M jobs.

30 Second Overview

Alibaba Cloud CloudMonitor CLI (aliyun cms2) consolidates capabilities such as integration, configuration, queries, alerting, and management events in the Cloud Monitor Service (CMS) 2.0 console into a unified command line entry. CMS Agent Skill organizes these commands into business workflows for AI Agents.

In the past, O&M automation often started from APIs: querying documents, assembling parameters, writing scripts, and invoking APIs. Now, by using the CloudMonitor CLI + Agent Skill, these capabilities can be organized into standardized workflows that AI Agents can understand, execute, and validate.

For O&M engineers, its value is not "one more tool", but allowing you to describe O&M targets in natural language, and having the AI Agent complete scenario understanding, CLI invocations, API executions, and result validations, transforming repetitive, multi-step, and error-prone observability O&M operations into confirmable, auditable, and reusable automated flows.

Why You Need CLI + Agent Skill

With the continuous growth of business size and infrastructure on the cloud, observability O&M runs through the end-to-end flows of resource integration, metric/log collection, alerting administration, link troubleshooting, Root Cause Analysis, and stability operations, and the O&M workload and operation complexity rise accordingly. At the same time, relying on powerful language understanding and job orchestration capabilities, AI Agents are becoming a new O&M collaboration entry. More and more teams are starting to try handing over repetitive, standardized, and multi-step jobs to Agents for assisted execution, and handing over complex problem troubleshooting to AI for assisted analysis.

However, to allow AI Agents to truly enter the production O&M closed loop, they cannot just stay at the stage of "understanding problems, generating suggestions, or scripts". A stable execution entry for CloudMonitor capabilities, standardized realm flows, necessary manual confirmations, and verifiable execution results are also required. The CloudMonitor CLI + Agent Skill is exactly the capability suite built for this requirement.

CLI + Skill Solutions

Alibaba Cloud CloudMonitor CLI (aliyun cms2) provides a unified, stable, and auditable capability entry. CMS Agent Skill consolidates the business semantics and operation flows in the CloudMonitor realm into workflows that AI Agents can understand and execute. By coordinating the two, AI Agents can start from natural language instructions such as "help me integrate this Container Service for Kubernetes (ACK) cluster into CloudMonitor", and automatically complete scenario detection, parameter generation, CLI invocations, API executions, and result validations.

Unified command tree:
The CLI has covered capabilities such as Integration Center, Prometheus service, application monitoring, Real User Monitoring, alerting center, and Event Center in the CMS 2.0 console. In the future, it will continue to cover capabilities such as Synthetic Monitoring and Grafana dashboards to achieve complete coverage of the CMS 2.0 console.
Native adaptation for AI Agent:
- It provides standardized, clear, and detailed --help information, and supports auxiliary capabilities such as --show-schema and --show-example-body to help AI accurately process various business scenarios.
- By default, it uses -o text to output compact Comma-Separated Values (CSV), significantly reducing AI token consumption.
- Through structured JavaScript Object Notation (JSON) error codes, it supports Agents in automatically making decisions and repairing based on fault reasons.
Skill driver:
The supporting Skill documents consolidate complete business workflows, allowing Agents to complete complex multi-step operations without hard coding.

CLI + Skill Workflow

For O&M engineers, the most intuitive change is: instead of starting operations from console entries or API parameters, they start from a clear O&M target, and the Agent completes subsequent executions and validations according to standard flows. The core of this link is "controllable automation": the Agent will not bypass the O&M system, but executes operations through the unified CLI entry and the business rules consolidated in the Skill. This can not only reduce repetitive labor, but also retain necessary permissions, confirmations, and audit borders.

Installation and Configuration
Install Skill/CLI

You can open the alibabacloud-cms-manage Skill on the Alibaba Cloud Agent Skills portal and follow the instructions on the interface to install the Skill.

After the installation is complete, when the AI Agent uses the Skill, the AI Agent automatically detects and guides you to install or update the Alibaba Cloud command-line interface (CLI) and the cms2 plugin to the required version. You do not need to manually handle environment dependencies.

# Verify that the CLI installation succeeded
aliyun version
# Verify that the cms2 plugin is active
aliyun cms2 --help

Configure Credentials
Multiple credential types such as AccessKey and Security Token Service token are supported. For more information, see Configure identity credentials for Alibaba Cloud CLI.

# Interactive configuration (recommended for first-time use)
aliyun configure

# Non-interactive configuration
aliyun configure set \
--access-key-id YOUR_AK \
--access-key-secret YOUR_SK \
--region cn-hangzhou

Practical Scenario 1 (Integration Center): Integrate Container Service for Kubernetes (ACK) Clusters into CloudMonitor

Business Scenario
The Site Reliability Engineering (SRE) team created an ACK cluster to deploy microservices. The SRE team needs to integrate the metrics of the cluster, such as nodes, pods, and containers, into CloudMonitor.

Usage
You only need to enter the following text in the AI Agent conversation:

Help me check which container clusters in Hangzhou do not have observability capabilities, and help me integrate them.

The Agent automatically completes the entire integration flow. Users only need to confirm at key steps.

Core Capabilities Supported by the AI Agent

Common Scenarios and Prompt Samples for Integration Center
Integrate by resource group: You can integrate all Relational Database Service (RDS) instances in the Beijing area under the default resource group into {workspace} of CloudMonitor.

Integrate by label: You can integrate all Elastic Compute Service (ECS) instances that match the label key={tagKey} and value={tagValue} into {workspace} of CloudMonitor.

Integrate across accounts: You can integrate all AI gateways in the Shanghai area of {resource directory member accounts uid} into CloudMonitor.

Monitoring widget deployment: You can add the integration of the ACK cost Insight widget in the integration policy {policy id/name}.

Metric collection target check: You can check whether the apiserver-related scrape targets of the ACK cluster {cluster Id/name} are normal.

Custom collection rule query: You can query the serviceMonitor/podMonitor/customJob list of the integration policy {policy id/name}.

Practical Scenario 2 (Alerting Center): Intelligent Alert Rule Management

Business scenario
The SRE needs to establish a comprehensive alerting system for the production environment. For example, the SRE configures professional node alert rules for container service cluster nodes.

Usage
The following is a typical conversation sample:

What recommendations do you have for container alerting? Then help me apply them.

Core Capabilities Supported by the AI Agent

Common Scenarios and Prompt Samples for Alerting Center
Intelligently analyze alert rules: You can analyze whether the existing alerts are configured reasonably and whether alert noise exists. If the configuration is unreasonable, you can modify the configuration with one-click.

Query alert rules: You can query all running alert rules of cloud service monitoring in the workspace {workspace}.

Modify Alert Rule contacts: You can change the Notification Recipient of the Alert Rule {rule ID/Name} to {contact}.

Delete an Alert Rule: You can delete the {rule Name} Alert Rule of the Prometheus instance {instance ID/Name}.

Query alerting history: You can query the alerting history of the Alert Rule {rule ID/Name} within 1 week.

Practical Scenario 3 (Prometheus service): Prometheus Instance Management and Data queries

Business Scenario
The O&M team needs to manage multiple Prometheus instances, analyze metrics and business health status, and configure Recording Rules to pre-aggregate high-frequency metrics.

Usage
The following are typical dialogue samples:

Help me check which Prometheus instances are available in Hangzhou, and group them by workspace.

The Following Is an Overview of the Core Capabilities Supported by the Agent

Common Scenarios and Prompt Samples for the Prometheus Service
Modify the storage duration of a Prometheus instance: You can modify the storage duration of the Prometheus instance {instance ID/Name} to 90 Days, and the archive duration to 180 Days.

Create a Recording Rule: You can create a Recording Rule under the Prometheus instance {instance ID/Name} to pre-aggregate the 5 minute average CPU utilization of each edge zone.

Stop a Recording Rule: You can stop the {aggregation Job Name} pre-aggregation Job under the Prometheus instance {instance ID/Name}.

Create a Prometheus aggregation view: You can create an aggregation view {aggregation view Name} that contains all Prometheus instances in the {area Name} area under the {workspace} Space.

Practical Scenario 4 (Application Performance Monitoring (APM)): Application Monitoring/AI Observability Onboarding

The onboarding flow for this scenario includes steps such as initializing the APM infrastructure, obtaining Credentials, registering the application, obtaining the configuration template, and authenticating the onboarding. The traditional onboarding procedure is relatively complex. Using the command-line interface (CLI) + Skill can greatly simplify the flow and achieve natural language interactive onboarding.

Practical scenario 5 (Data query): metadata, Prometheus Query Language (PromQL), and CloudMonitor Basic metric queries

Business Scenario
You can query metadata, Prometheus metric Data, and CloudMonitor Basic metric Data to analyze business running conditions and troubleshoot faults or problems.

Usage
The following are typical dialogue samples:

List of ECS instances with the highest CPU utilization: You can find the 10 ECS instances with the highest CPU utilization in the last half hour.

The Following Is an Overview of the Core Capabilities Supported by the Agent

Common Scenarios and Prompt Samples for Data Queries
RDS slow queries: You can query the Quantity Trend of slow queries with a running time exceeding 1 second in the past 30 minutes.

Waste of container resource Requests: You can find "zombie" resources in the container cluster that have overly large resource requests but very little actual usage in the past 7 Days.

Suspected container pod memory leak: You can find the List of pods under {ns} of the container cluster {cluster Name/ID} whose memory usage has continuously increased in the past 1 hour and whose current value exceeds 90% of the limit.

Summary

Alibaba Cloud CloudMonitor command-line interface (CLI) (aliyun cms2) and the accompanying CMS Agent Skill do not just migrate console and API capabilities to the command line, but also build a standard operation interface for AI agents for observable O&M. It unifies the capabilities scattered across scenarios such as provisioning, configuration, query, alerting, and management events. This allows O&M engineers to express targets more naturally, execute operations in a more controllable manner, and complete authentication and auditing with a clearer link.

For O&M teams, this means that observability construction is gradually moving from the manual stage of "people finding entrances, people piecing together parameters, and people performing authentication" to the collaborative stage of "people defining targets, agents orchestrating flows, CLI executing operations, and AI validating results".

AI is not intended to override O&M judgments, but to significantly reduce the costs of repetitive operations, cross-system collaboration, and complex flow execution, improve the efficiency of troubleshooting and fault localization, and allow SREs to devote more energy to higher-value work such as stability design, alerting administration, and fault review.

In the future, we will continue to enrich the capability scope of CLI and Skill, and comprehensively overwrite CloudMonitor business scenarios. In the AI era, CloudMonitor CLI and Skill hope to become a stable, trusted, and extensible observability capability base between O&M engineers and AI agents, promoting automated and Intelligent O&M from single-point attempts to large-scale implementation.

Appendix—CMS CLI Command Tree

aliyun cms2
│
│                # Integration domain
├── integration                 Integration (includes the full lifecycle of provisioning policies, add-on widgets, collection rules, etc.)
│   ├── policy                  Integration policy Management, including commands such as create, get, update, delete, and list.
│   ├── storage                 Query Prometheus storage instances attached to provisioning policies, including commands such as list.
│   ├── dashboard               Query Grafana dashboards associated with provisioning policies, including commands such as list.
│   ├── resource                Query resources of container service class provisioning policies, including commands such as list.
│   ├── job-target              Query the Status of scrape targets of Collection Tasks of provisioning policies, including commands such as list.
│   ├── service-monitor         Query Kubernetes ServiceMonitor collection rules of provisioning policies, including commands such as list.
│   ├── pod-monitor             Query Kubernetes PodMonitor collection rules of provisioning policies, including commands such as list.
│   ├── custom-job              Query Custom Prometheus Collection Jobs of provisioning policies, including commands such as list.
│   ├── addon-release           Management of deployed widget instances of provisioning policies, including commands such as create, get, update, delete, and list.
│   └── addon                   Management of active provisioning widget folders, including commands such as get and list.
├── workspace                   Workspace Management, including commands such as create, get, list, update, and delete.
│
│               # APP Application Management domain
├── prometheus                  Prometheus service Management (includes Prometheus instances, aggregation views, RecordingRules, etc.)
│   ├── instance                Prometheus instance management, including commands such as create|get|update|delete|list
│   ├── view                    Prometheus aggregation view management, including commands such as create|get|update|delete|list
│   └── recording-rule          RecordingRule pre-aggregation management, including commands such as create|get|update|start|stop|delete|list
├── apm                         application performance monitoring management
│   ├── service                 application performance management (APM) application service management, including commands such as create|get|update|delete|list
│   └── configuration           APM configuration management, including commands such as get|create
├── rum                         Real User Monitoring management
│   ├── service                 Real User Monitoring (RUM) application service management, including commands such as create|get|update|delete|list
│   └── configuration           RUM configuration management, including commands such as get|create
│
│               # alerting and management event domain
├── alert                       alerting center management (including Alert Rule, alerting template, alerting History, etc.)
│   ├── rule                    Alert Rule management, including commands such as create|get|update|patch|delete|list|enable|disable
│   ├── template                Alert Rule template management, including commands such as list|get|create|update|delete|apply
│   └── history                 alerting trigger and recovery History management, including commands such as list
├── notification-channel        notification channel management
│   ├── contact                 alert contact (Email, text message, and DingTalk) management, including commands such as list
│   ├── robot                   alerting robot (DingTalk/Lark/WeCom group robot) management, including commands such as list
│   └── webhook                 Webhook address management, including commands such as list
├── event-hub                   Event Center management, including commands such as list|get
│
│               # Data query domain
├── metric                      Metric query
│   ├── promql                  Prometheus Query Language (PromQL) instant/range query and metadata retrieval, including commands such as query|query-range|labels|label-values|series
│   └── basic                   CloudMonitor 1.0 Metric query, including commands such as points|latest|range|top|export
├── trace                       Trace data query, including commands such as search and tree
├── entity                      Cloud resource and EntityStore query, including commands such as query
└── meta                        Metadata query, including commands such as metrics, namespaces, and events

Enabling Boundless Intelligent Collaboration: UModel Now Open Source, Launching the Universal Semantics Standard Initiative

ObservabilityGuy — Thu, 02 Jul 2026 09:48:30 +0000

Let data speak the same language and enable boundless intelligent collaboration.

Hangzhou, May 20, 2026 — At the 2026 Alibaba Cloud Summit, Alibaba Cloud officially open sourced Unified Model (UModel) and launched the Universal Semantic Standard (USS) industry initiative, aiming to break down the fundamental barriers of semantic fragmentation in enterprise data and build a semantic foundation for large-scale AI implementation.

Every system in an enterprise faithfully records facts. The alerting system records anomalies, the log platform saves fault samples, the Trace system records invocation chains, Kubernetes manages running status, the release systemrecords changes, the Configuration Management Database maintains configurationrelationships, CRM tracks customers, and ERP manages orders and inventory.However, enterprise digitalization and intelligent transformation faces a long-underestimated fundamental barrier — semantic fragmentation.

Pervasive data silos: According to industry research, enterprises use more than 30 Software as a Service tools and internal systems on average.Data is scattered across hundreds of "data chimneys" such as O&M monitoring, business systems, financial platforms, and customer management. A "checkout-service error rate spike" may simultaneously be associated with a recent release, a downstream dependency timeout, abnormal restarts of three pods, and an SLO violation — these clues are scattered across five or six systems with no unified object border, relationship structure, or evidence chain between them. Engineers rely on experience to jump between systems and piece together context, spending over 80% of their time "finding info" rather than "making judgments."
Semantic inconsistency as a pervasive problem: For the same "sales" indicator, finance, operations, and e-commerce each have their own set — three numbers, three meanings, and three definitions sharing the same name. The same problem is equally common in the O&M realm: different monitoring platforms use different collection methods, formulas, and time windows for "error rate." Cross-platform troubleshooting often becomes a semantic translation marathon. To snap the different expressions of the concept "customer" across three systems (account, buyer, and customer_id), data analysts may need to spend two days on caliber mapping before they can begin actual analysis work.
Large-scale AI implementation blocked: Without unified semantic context, AI agents cannot reliably understand the meaning of data across platforms and frequently hallucinate, significantly undermining intelligent decision-making quality. In the past, human experience could bridge this semantic gap at the cost of low efficiency but controllable results. However, when enterprises begin involving AI agents in O&M, customer service, analysis, and automated decision-making, the gap shifts from an "efficiency problem" to a "capability problem" — an agent can invoke ten tools and retrieve ten sets of data, but cannot determine whether they belong to the same service, the same change, or the same causal chain. Without semantics, AI cannot establish cross-system data associations and causality. It can only summarize single-source data and cannot support end-to-end intelligent decision-making.
Collaboration costs increase exponentially: Data semantics across departments, systems, and tools are disconnected, resulting in persistently high communication friction. A simple data analytics task can take 3–5 days to snap definitions, translate fields, and confirm meaning — less than 20% of the time is actually spent on analysis and decision-making. Businesses don't lack data, and they don't lack tools. What's missing is a unified semantic runtime that makes the enterprise world understandable to people, systems, and AI. To solve the above problems, Alibaba Cloud officially open sourced Unified Model (UModel) at the Cloud Summit on May 20.

1.UModel: Object Graph Semantic Runtime for Enterprise AI

As the core technical solution built by Alibaba Cloud to solve the above problems,UModel is an object graph semantic runtime for enterprise AI. It uses objects and relationships to describe the enterprise world, making these descriptions queryable, authenticatable, and programmable for Agent invocation. It is not another observability tool, Configuration Management Database, or knowledge graph. It sits on top of these systems and organizes the existing facts within them into a unified object graph. By combining the corresponding capabilities, it transforms enterprise data from "being recorded separately by each system" to "being uniformly organized, queried, authenticated, and invoked around objects." For the core problems of enterprises, UModel provides clear solutions:

(I) Design Choices of UModel
The UModel approach differs from most data integration solutions. The following three design choices determine its engineering form — first define the object, make the specification verifiable, and then connect existing systems based on a stable specification.

Object First: Define borders before mapping data A common override path is "data first" — connecting data from multiple systems to the same platform and then trying to aggregate objects from it. The problem with this path is that the same business entity has different expressions in different data sources, each aggregation requires remapping, and the mapping result is unstable — the "service A" aggregated today and the one aggregated tomorrow may not be the same thing.

UModel chooses to bear the upfront modeling cost: an object is defined once, and its identity and relationship structure remain stable regardless of pod reconstruction, data source switching, or new system integration. This cost is naturally amortized across multiple scenarios—the same EntitySet can be shared by O&M agents, analysis agents, and Customer Service agents.

Specification as Code: Turning Semantic Specifications into Verifiable Engineering Assets Most enterprises also have data dictionaries or semantic specifications—usually wiki pages or Excel sheets. The problem is: outdated documents don't throw errors, renamed fields don't trigger diff tooltips, and there's no way to verify whether two teams understand the same concept consistently. Over time, specifications gradually drift from reality and become untrusted artifacts that no one relies on.

UModel manages semantic specifications as code: model changes go through PR review, imports include schema checksum validation, and whether different implementations understand the same set of semantics is automatically verified using Conformance Cases. Specifications aren't "valid just because they're written"—they "only count when tests pass," solving the disconnect between specifications and reality at the root.

Connect, Don't Replace: No Data Migration, Just Semantic Bridges The object graph stores "who the object is, what the relationship is, and where the evidence can be found."

UModel connects to existing data sources through semantic mapping — describing "where to look up an object's indicators and where to find its logs" instead of moving data over. Data stays in place, semantically organized in UModel. Compared with aggregating all data to a unified platform (high cost, long epoch, and fragile MPS queue), the cost of connecting existing systems is far lower than replacing them. Enterprises can implement incrementally without refactoring their existing IT architecture.

2.Technical Advantages of UModel

Object graph traversability: automatically assemble complete context along relationships This is the most core technical capability of UModel. In the object graph, each entity is connected to other entities through typed relationships: a devops.service is associated with the devops.deployment that deploys it, the devops.slo that measures it, the devops.team it belongs to, and the k8s.workload that runs it. These relationships are not diagrams drawn in documents — they are topology data queryable at runtime.

Given any object, its complete relation network can be traversed through a .topo query. The agent does not need to know in advance "which systems should be checked to troubleshoot a service failure" — it starts from the failure object and traverses the graph to find associated deployments, changes, SLO violations, and downstream dependencies. The causal chain emerges along the graph structure, without relying on prompt engineering guesses.

Link system: turning "where is the data" into a property of an object UModel doesn't just define "what objects exist" — it connects objects and their evidence completely through the Link system. DataLink connects entities with observed data: a devops.service is associated with its indicator set, log set, and trace set through DataLink, and describes how to locate specific monitoring data from the entity through field mapping (for example, entity field service_id corresponds to indicator label monitored_service_id). StorageLink connects datasets with physical storage: which MetricStore the indicators are stored in and which LogStore the logs are stored in are described by StorageLink — this makes "where the data is" part of the object graph instead of being scattered in configuration files or in the memory of O&M engineers. EntitySetLink defines the topology relationship semantics between objects.

The three types of Links combine to make the object graph a complete semantic description of "who the object is, where the evidence is, and how to query the data." After obtaining an object, an agent can follow Links to find the query paths and storage positions of its indicators, logs, and traces — even if the agent currently needs to execute queries against the corresponding data sources itself, it at least knows where to query and what conditions to use.

Vendor-neutral, specification-first: the semantic layer is not bound to any platform
Neither the semantic definition nor the runtime services of UModel are tied to a specific vendor or platform:
GraphStore Provider abstracts the storage backend — currently providing three implementations: memory, file.memory, and local.ladybug. Enterprises can develop their own providers to connect to existing graph databases or storage systems.
Multi-domain coexistence — Models for DevOps, Kubernetes, and business systems can be defined in different domains of the same workspace and associated through cross-domain EntitySetLink. In the previous traversable object graph example, devops.service → k8s.workload is a cross-domain relationship.
Model definitions and public contracts (OpenAPI, MCP schema, and software development kit types) are standard format files that do not depend on specific toolchains.
This design allows enterprises to adopt incrementally: first define the model for one realm, authenticate the value, and then extend to other realms — the evolution of the semantic layer is not affected by underlying storage changes or platform switching.

3.Umodel Evolution Route Planning

This open source release and initiative is just the first step for UModel. The following capabilities have been internally authenticated and will be gradually available in subsequent versions:

USearch:unified semantic search. The current Query Service provides structured object graph queries (.umodel, .entity, .topo), suitable for exact access to known object types and IDs. USearch complements the ability to quickly locate entities based on entity fields — supporting inverted index exact match, vector semantic approximate search, and hybrid retrieval. When an Agent faces blur clues (a service name fragment, a fault description), USearch allows it to quickly converge to a specific entity, then expand the context through the object graph.
Platform as a Service API: object-level semantic query. The open source version of Query Service queries the object graph itself (model, entity, topology). The Platform as a Service query proxy solves the next problem: given an entity, how to obtain its associated data? Based on the mapping between DataLink and StorageLink, it translates object-level intents such as "query checkout-service fault rate indicators" into executable search statements — but is not responsible for execution. The returned search statements are executed by the AI Agent or upper-layer system in the corresponding data source.
More realm models and providers: Continuously expand model packages across realms such as databases, MSMQ, Cloud Products, network devices, and business systems. At the same time, more GraphStore Provider implementations are available to support enterprises in persisting object graphs to the Graph Database or storage backend of their choice.

4.Administration Models and Participation Paths

UModel is now officially open source: the core closed loop can run locally, semantic specifications and Query Service are stable, and AgentGateway / MCP is active. Production-level graph storage providers, broader realm model coverage, and large-scale performance authentication are the directions that need to be advanced by the community together — which also means now is the most effective participation window to influence the project direction.

1 minute: Star the project and follow the latest developments.
5 minutes:Try the demo. Pull the code and execute make quickstart. Open the Web UI to browse example object graphs and get an intuitive understanding of the solution capabilities.
Half a day: Make your first commit. Read the model definitions in examples/quickstart multidomain/→ Define 3 EntitySets for a realm you are familiar with → Add relationships and sample data → Submit a PR.
Ongoing contribution directions (see project documents for detailed contribution guidelines):
Realm model: Contribute EntitySet and relationship definitions for realms such as databases, MSMQ, Cloud Products, and business systems.
GraphStore Provider: Connect to a new graph storage backend to drive production-grade storage capabilities
Agent tools: Contribute MCP tools, query templates, and prompts to reduce agent exploration cost
Conformance Case: Write consistency authentication use cases to ensure different implementations understand the same set of semantics

5.Quick Start

Quick Start
This open source release provides a complete closed loop that can run locally: define the model → import workspace → write entities and relationships → unified query → agent exploration → consistency authentication. The repository contains the full test suite (contract / integration / e2e / golden tests) constraining the public contract.

Startup

make quickstart

Three portals are active after startup:

API: localhost:8080 (REST + MCP)
Web UI: localhost:5173 (browse object graph, execute queries, view agent tool list)
Preloaded data: demo workspace with multi-domain model examples including DevOps, Kubernetes, and supply chain

Developer Perspective: Command-Line Interface (CLI) and Web UI

Query the object graph with the command-line interface (CLI):

# What does the checkout-service object look like?
umctl query run demo \
  ".entity with(domain='devops', name='devops.service', query='checkout') | limit 10"
# What objects is it directly related to?
umctl query run demo \
  ".topo | graph-call getDirectRelations([(:\"devops@devops.service\" {__entity_id__: '10000000000000000000000000000101'})]) | limit 10"

You can also open the Web UI directly — browse object types and instances in the Explorer view, interactively execute the above queries in the Query view, and visualize the object relationship topology in the Graph view. For complete query syntax, see the document. Agents do not need to handwrite these statements when invoking through MCP.

AI Deep Scenario: Agent Systematic Troubleshooting Through MCP + Runbook

The checkout-service example above is single-cause, interference-free — the agent can find the answer by walking a few steps along the graph. However, failures in real production environments are often more complex: the cause is not a single change but a superposition of multiple changes, misleading clues interfere with judgment, and key evidence is scattered across different systems managed by different teams. When there are more than 5 clues involving more than 3 management domains, pure graph traversal is not enough — you need a structured diagnostic protocol to ensure the completeness and correctness of the troubleshooting.

Scenario: 02:17 AM, payment-gateway P99 latency breaches SLO

The on-duty SRE is woken up by an alert. payment-gateway is a platinum-level service — every minute of abnormal behavior means direct business loss. A quick determination is needed: roll back the most recent deployment? Scale out? Or is there another cause?

Timeline:

T-24h: The retry configuration of checkout-service was changed from 2 to 5 (a routine Configuration Change; traffic was normal at the time and no issues occurred)
T-4h: The 618 promotion was activated, and traffic began climbing to 3.5x
T-0: Alerting triggered, P99 latency > 2000ms What makes this difficult? The root cause is not a single management event, but the superposition of two independent changes within a specific time window: the Configuration Change alone would not cause issues (4000 × 2.5 = 10,000 QPS, within capacity); the promotion traffic alone would not cause issues either (4000 × 3.5 = 14,000 QPS, acceptable). But when both take effect simultaneously: 4000 × 3.5 × 2.5 = 35,000 QPS, 8.75 times the normal capacity — payment-gateway is overwhelmed.

Trickier still: there was a deployment 12 hours ago. An SRE's first instinct is to suspect it—"what changed recently" is a cognitive shortcut for troubleshooting. But this time it's a red herring: the change was a log format adjustment, unrelated to latency. If you intuitively rollback this deployment first, the problem won't be resolved, and you'll have wasted valuable time.

Moreover, key clues belong to different management domains: alerting and deployment records are in the Platform domain (managed by the O&M team), while promotions are in the Business domain (managed by the business team). In traditional troubleshooting, SREs may not even know a promotion is underway—until they call the business team.

payment-gateway (degraded, platinum SLO)
  ← calls ← checkout-service
               ← affects ← cfg-checkout-retry (max_retries 2→5, 24h ago)
                              ← triggers ← 618 Flash Sale (3.5x traffic)
Exclusion: payment-gw v3.2.1 (12h ago, trivial logging change)
Root cause: 4000 × 3.5 × 2.5 = 35,000 QPS → 8.75x overload

UModel's answer: Runbook—a diagnostic protocol executable by agents

UModel solves such complex scenarios through Runbook. A runbook is not a document or a wiki—it is a structured protocol that agents can execute programmatically, comprising three layers:

Observation (what to check): Defines the specific steps for each check item—which entities to query, which relationships to traverse, and which fields to compare.
Conclusion (how to judge): Define matching conditions for each observation—what conditions are met to draw what conclusions, and what the severity level is.
Knowledge (why it is a problem): Provides interpretations and compute formulas for failure patterns to help the Agent understand rather than just match.
Unlike retrieval-augmented generation, which allows the Agent to infer on its own after retrieving documents (prone to omissions and non-reproducible), Runbook ensures two things: completeness—it will not forget to check the key factor of service traffic; determinism—the same data, executed at different times by different models, yields consistent conclusions.

In this scenario, the platform.service.ops Runbook defines three observations for the Agent: upstream_retry_amplification (check upstream retry configuration), recent_deployment_correlation (exclude or confirm recent deployment), and business_traffic_pressure (detect service traffic pressure). The Agent executes them one by one according to the protocol and does not skip subsequent checks just because there was a "recent deployment."

From UModel to Universal Semantic Standards: Something No Single Enterprise Can Accomplish Alone

Alibaba Cloud serves more than 4 million enterprise customers worldwide and has deep product expertise in observability, data governance, and enterprise intelligence. Through years of serving enterprise customers, we have come to recognize the systematic barriers that semantic fragmentation poses to unlocking data value and implementing AI. Enterprise semantics spans too many realms, too many scenarios, and too many existing systems — O&M has its own object models, business has its own entity definitions, and each industry has its own domain knowledge. This is by no means something a single company can accomplish alone. We firmly believe that unification and openness are the only keys to unlocking the full value of enterprise data and the full potential of AI.

Therefore, alongside the open source release of UModel, Alibaba Cloud officially launches the Universal Semantic Standard (USS) industry initiative — an open source collaboration schedule for the entire industry that aims to fundamentally solve the "semantic dialect" problem of enterprise data and reshape the way humans interact with data and data interacts with applications.

We have joined hands with the first batch of founding members, including industry benchmark enterprises such as Chanjet, Shenzhou Shanglong, XPeng, Zhuoyu Technology, and JLCPCB, as well as industry institutions such as CAICT and the Chinese Academy of Sciences, covering ecosystem partners in cloud computing, observability, BI analysis, enterprise software, and other fields, to jointly lay the foundation for building a more open, interconnected, and intelligent enterprise data ecosystem. Through the USS initiative, we aim to work with partners to confront the core obstacle to artificial intelligence in the enterprise domain — the lack of unified semantic standards. This is an industry collaboration that transcends competition, with the goal of jointly solving the fundamental problem of enterprise data intercommunication and mutual recognition.

(I) Core goals and principles of the Universal Semantic Standard initiative
The core objectives of the USS initiative are: to establish a set of common, vendor-neutral enterprise-level semantic model specifications that provide unified semantic representation for all types of enterprise data, including indicators, logs, traces, management events, and business entities; to enable seamless semantic interoperability among enterprise business software, O&M tools, and AI agents; and to accelerate digital transformation and large-scale AI implementation across industries. By providing unified, standard semantic specifications, the initiative ensures that all enterprise data is defined consistently and value-aligned across platforms, departments, and systems, completely eliminating semantic ambiguity across tools, departments, and systems. The initiative follows five core principles:

Standardization: Establishes a unified enterprise semantic model definition language and data structure, covering core elements such as business entities, indicator definitions, log fields, trace labels, and alerting rules, ensuring consistency and interpretability across tools and systems.
Interoperability: Facilitates seamless data exchange and cross-platform application of semantic models among diverse enterprise systems such as application performance management, log platforms, artificial intelligence for IT operations, BI tools, CRM, and ERP.
Scalability: The semantic model supports flexible extension and customization, meeting evolving requirements from monolithic applications to cloud-native microservices, from traditional IT to large-scale distributed systems, and from general business to industry-specific scenarios.
Open source collaboration: Driven by the open source community, encouraging global developers, vendors, and enterprises to participate and contribute, ensuring the frame maintains sustained vitality and relevance amid rapid technological iteration.
Domain-specific modeling: Provides standardized semantic representations for typical enterprise scenarios—such as O&M golden signals, SLO/Service-Level Agreement definitions, business indicator systems, customer entity models, and supply chain relationships—simplifying the integration and reuse of multi-source data. USS fully respects and is compatible with existing excellent industry standards. It is not intended to override them, but to provide a unified semantic framework at a higher abstraction layer. OpenTelemetry defines the naming conventions for traces/metrics/logs in the observability realm but does not cover business data semantics—USS extends upon this foundation to cover all enterprise scenarios. Open Semantic Interchange (OSI) addresses the problem of semantic model exchange between BI/analysis platforms—USS collaborates and complements it, covering a wider range of enterprise application scenarios. The Schema.org/W3C Semantic Web standards provide common semantic tagging and ontology modeling capabilities—USS draws on their design philosophy but focuses on the practical requirements of an enterprise's internal data ecosystem. USS serves as a semantic unified layer for all enterprise scenarios, bridging downward to domain-specific proprietary standards while providing a unified semantic context across systems for AI agents.

(II) Expected Results
Through the USS initiative, we will jointly achieve:

Accelerate enterprise AI scaling: A unified semantics standard enables AI models to truly "understand" the meaning of data from different platforms and departments, significantly improving the accuracy and deployment speed of AI applications across all scenarios, including artificial intelligence for IT operations, intelligent customer service, intelligent analysis, and intelligent prediction.
Reduce data governance costs by an order of magnitude: A universal semantics standard provides a common language for enterprise data architectures spanning multiple data sources, tools, and systems, freeing data teams from tedious metric snap and field translation so they can focus more on data value mining.
Ensure vendor neutrality and freedom of choice: The frame is independent of any specific platform or AI tool vendor, giving enterprises true freedom of choice when building digital infrastructure and avoiding vendor lock-in.
Build an enterprise-grade "semantics operating system": Upgrade from a passively consulted "data dictionary" to a live, proactive semantics runtime that can be programmatically invoked by AI agents, laying the foundation for future large-scale multi-agent collaboration across enterprises.

6.Conclusion

At this pivotal moment when AI is penetrating enterprise production systems at scale, semantic unification of data has become the core foundation for industrial upgrade. The UModel and USS initiatives are our answer for this era — ensuring data is no longer scattered fragments and enabling AI to truly understand how enterprises run.

The core philosophy of UModel is clear and resolute: reconstruct an enterprise's business and data worlds into a unified object graph, enabling AI to understand the complete context along the relationships between entities and make reliable judgments based on traceable evidence. As the capability boundaries of agents continue to expand, what truly constrains the release of their value will no longer be the upper limit of the model itself, but whether the enterprise's business world can be truly understood and perceived by AI. The mission of UModel is to bridge this critical semantic gap — so far, we have taken the lead in completing the implementation and validation of this path in the observability realm, proving its feasibility.

Achieving this goal is by no means something a single team can accomplish alone. That is why we chose to open source the project and launch the Universal Enterprise Semantic Standards initiative — we firmly believe that the right path for this undertaking is rooted in the spirit of open collaboration, and we uphold three open principles: open standards and specifications, open technology implementation, and open effect verification. Starting from observability scenarios, we have completed the initial implementation of our methodology. Now, we look forward to working hand in hand with community partners to gradually extend the capabilities of the enterprise object graph from this realm to the broader landscape of data governance, business systems, and cross-domain collaboration.

We sincerely invite every industry peer — whether you are an enterprise software vendor, an AI solutions provider, a developer in the open source community, or an enterprise user across industries — to join us on this pivotal journey to build a more open, more connected, and more intelligent enterprise data ecosystem. We believe that through open collaboration, we will jointly drive the transformation of enterprise AI from "fragment summarization" to "deep understanding," and together create a truly AI-perceptible intelligent enterprise future.

Let data speak the same language. Let intelligence collaborate without boundaries.

Alibaba Cloud Releases RCA Benchmark, the Industry's First Open Source Root Cause Analysis Benchmark System for Agentic Ops

ObservabilityGuy — Tue, 16 Jun 2026 01:59:01 +0000

This article introduces Alibaba Cloud's open-source RCA Benchmark for evaluating AI agents in IT operations.

Alibaba Cloud has released RCA Benchmark to build a standardized root cause analysis evaluation dataset and evaluation protocol system for Agentic Ops. It is also the industry's first open source benchmark project that addresses the evaluation of AI Agent diagnostics capabilities for distributed system failures at the system level. Alibaba Cloud has partnered with institutions across the realms of observability, artificial intelligence for IT operations, and cloud-native infrastructure — including CAICT, the Institute of Software/Computer Network Information Center of the Chinese Academy of Sciences, Tsinghua University, Fudan University, and Nankai University — to jointly build an industrial ecosystem and establish a standardized, trustworthy O&M agent evaluation system, laying a solid foundation for large-scale industry adoption.

Based on long-term product implementation and service practices in the observability and artificial intelligence for IT operations realms, Alibaba Cloud has recognized that rootcause analysis is the most complex and hardest-to-standardize core process in O&M agent capability evaluation. Unlike tasks with fixed inputs and ground truth such as text Q&A and code generation, RCA Agents operate against continuously running distributed complex architectures. They must proactively filter valid info from multi-source observability data such as indicators, logs, Tracing Analysis, and system events, trace abnormal propagation paths based on service dependencies and entity topology relationships, and ultimately locate the root cause of failures. The industry has not yet established a unified, systematic evaluation benchmark, making it impossible to objectively compare the fault diagnostics capabilities of various AI Agents or to quantify the effectiveness of technology evolution and capability iteration.

Industry urgently needs a unified RCA evaluation standard

As enterprise Agentic Ops enters the stage of large-scale implementation, the lack of an evaluation system has become a key constraint on industry development, and the traditional evaluation paradigm can no longer meet the development demands of artificial intelligence for IT operations:

Traditional evaluation mode completely fails
Root cause analysis is not a simple text processing task. AI Agents must perform real-time indicator queries, log analysis, Tracing Analysis, and change management event assessment, and conduct cross-tool collaborative diagnostics. Traditional evaluation methods that rely on static log fragments and a single label cannot distinguish whether an agent completes full logical reasoning-based diagnostics or merely achieves an accidental hit based on alerting appearances, resulting in significant shortcomings in evaluation effectiveness.
Multi-source observable data is difficult to standardize
RCA evaluation involves multi-source observable signals such as indicators, logs, Tracing Analysis, and system events. These data types are coupled across time and entity dimensions, and failure impact propagates layer by layer along business traces. Taking a database slow query failure as an example, it triggers a chain reaction: increased MySQL query time, increased invoked service latency, upstream service timeout, and frontend 5xx errors. Single-dimension observable data can only render partial symptoms and cannot revert the complete failure propagation logic.
Causal propagation chains easily lead to evaluation misjudgment
The industry commonly confuses abnormal symptoms with failure root causes. Frontend alerting mostly reflects the end of a failure trace, while the real root cause often lies in downstream databases, caches, MSMQ, or the container scheduling layer. If a dataset does not fully depict the causal propagation path and a diagnosis is deemed correct simply by hitting services around the alert, evaluation aliasing is highly likely.
Cross-domain entity identity lacks a unified specification
The same business entity has completely fragmented naming systems across different O&M systems: naming conventions for the same business entity are fragmented across the application performance management, Kubernetes, and cloud resource layers. Evaluation can only rely on character matching or manual subjective judgment, resulting in problems such as unstable scoring, non-reproducible results, and unauditable flows.

In this context, Alibaba Cloud states clearly that building a systematic, standardized RCA Agent evaluation benchmark has evolved from an academic research topic into essential infrastructure for the large-scale implementation of Agentic Ops.

RCA Benchmark Core Definition

RCA Benchmark is not a single-file dataset, but a benchmark suite evaluation system with a complete architecture and closed-loop logic. It consists of three modules: runtime environment, structured sample set, and evaluation protocol.

Runtime Environment: Build a microservice simulation system capable of generating real failure signals, supporting interactive diagnostic queries by AI agents, and completely eliminating the traditional pattern of providing only standard log fragments.
Structured Sample Set: Build a fault sample library with Layer 4 structured ground truth. Each case fully covers four core elements: fault type, normalized root cause entity, causal propagation chain, and key evidence checkpoints.
Evaluation Protocol: Define standardized scoring rules to convert AI agent outputs into quantitative fractions for horizontal comparison. Centered on deterministic rules, this minimizes dependency on Large Language Model (LLM) review and ensures fair and objective scoring.

The project covers all mainstream scenarios, including microservices model failures, database and intermediary failures, Container Orchestration and cloud-native platform failures, cloud resource layer failures, and LLM and agent runtime failures.

RCA Benchmark Core Design Principles and Overall Technical Architecture

RCA Benchmark takes real-world native simulation as its core design concept. It builds a benchmark foundation based on an E-commerce microservice architecture deployed in Kubernetes clusters, containing over 40 business services with call chains up to 7 layers deep. It does not use synthetic data and fully covers typical business dependencies such as synchronous RPC, asynchronous messages, databases, caches, MSMQ, and gateways. With full-domain access to the observability foundation, it supports agents in retrieving seven categories of observation data: indicators, logs, Tracing Analysis, alerting, resource topology, Kubernetes events, and performance profiling. By continuously injecting differentiated background traffic, it replicates production day-night fluctuations, business peaks, and scheduled batch processing payload features to establish a reliable pre- and post-failure comparison baseline.

The project innovatively introduces a four-layer structured ground truth system, abandoning the traditional single root cause label pattern. It completes standardized definitions spanning failure types, normalized entities, causal propagation chains, and key evidence edge zones, with a complementary root cause identification, boundary demarcation, and procedure three-dimension weighted scoring frame that calculates composite scores at 40%, 30%, and 30% weights. Nearly 70% of scores rely on deterministic quantization computation based on failure type topology semantics distance and entity topology distance. Multi-dimensional graded evaluation covers failure semantics matching, topology positioning accuracy, diagnostic evidence, and causal logic completeness, systematically avoiding evaluation bias from random hits. The entire process features transparent rules, reproducible results, and auditable flows.

The platform achieves full-scenario coverage of over 40 failure types across 6 categories at the application layer, intermediary layer, container platform layer, and cloud resource layer through four injection channels: chaos engineering tools, Kubernetes-native O&M, switch configuration, and Alibaba Cloud service APIs. It builds a failure coverage graph across vertical and horizontal dimensions to ensure comprehensive and balanced evaluation scope. To address the industry pain point of fragmented cross-domain entity identities, the platform incorporates a unified entity model (UModel) that assigns cross-domain unique primary keys to all entities, completing multi-domain entity mapping and topology distance calculation through a standardized normalization flow, enabling end-to-end traceability, reproducibility, and auditability.

The system also establishes a four-layer GSTO Quality Gate with multiple admission checks covering structure specifications, signal validity, time windows, and open adaptability settings, strictly filtering invalid samples with failure chain aliasing. Over 200 compliance samples have been accumulated to date, covering all failure type categories and classified into four difficulty levels (L1–L4), with L2 and L3 medium-to-high difficulty scenarios serving as the core evaluation focus.

The project adheres to the principle of open source co-construction. Core capabilities including the evaluation frame, failure directory, scoring protocol, and Quality Gate are fully open source, with co-construction channels open to observability vendors, Agentic Ops developers, and enterprise SRE teams. Reserved non-public test samples and compliance gates prevent data contamination and ensure the fairness and credibility of industry evaluation rankings.

Alibaba Cloud's open source RCA Benchmark establishes a standardized, reproducible, and auditable unified capability ruler for Agentic Ops in the industry, enabling objective benchmarking and quantitative measure of diagnostics capabilities across different agents. Leveraging the tiered difficulty system and full-scenario failure coverage, it supports enterprises in technology selection and business implementation iterations. By open-sourcing core capabilities, it significantly reduces the cost of building in-house evaluation systems. Through dynamic dataset updates, saturation monitoring, and a closed-loop scenario feedback mechanism, the benchmark continuously iterates its capabilities, co-building a long-term evolving, open, and shared O&M intelligent agent industry ecosystem.

Make every failure assessment evidence-based, and every diagnostics capability quantifiable, benchmarkable, and evolvable.

From Black Box to Transparent: Alibaba Cloud Agent Observability and Audit Data Collection in Practice

ObservabilityGuy — Thu, 11 Jun 2026 02:24:01 +0000

This article introduces Alibaba Cloud's LoongSuite solution for comprehensive AI agent observability and audit data collection using extended OpenTelemetry GenAI semantic conventions.

I. Introduction

In 2025, AI agents are moving from the lab to large-scale production. From code assistants used by developers daily to intelligent customer service in enterprise service scenarios, to multi-agent collaboration systems of ever-increasing complexity, AI agents are reshaping software development and business operations at an unprecedented pace.

However, once agents are actually running, a critical problem emerges: the actual runtime behavior of AI agents is difficult to observe, trace, and govern.

A coding agent autonomously and without authorization modifies core configuration files overnight, with no way to know what changed or why. An intelligent customer service agent autonomously issues a "cancel order" instruction, yet the decision logic, tool calling chain, and token resource consumption cannot be reviewed. A multi-agent collaborative job fails midway, and the failure node and root cause are difficult to pinpoint.

These issues point to a common requirement: AI agents need comprehensive observability. Moreover, this observability cannot remain at the shallow statistical dimension of "request success/failure" — it must deeply cover AI agent-specific runtime aspects such as LLM invocation, tool execution, multi-round inference, and memory retrieval.

Based on the OpenTelemetry (OTel) community standard and its in-depth practices in observability fields, Alibaba Cloud has developed a complete data collection solution that covers three types of agent forms. Building on the OTel GenAI semantic conventions, Alibaba Cloud has released the LoongSuite GenAI semantic conventions for observability. This paper will systematically introduce the design concept, technical implementation and use of this scheme.

II. Agent Form Classification and Observability Challenges

The AI agent market is thriving and highly diverse. The runtime models, deployment environments, and use cases of different agent types vary significantly, and their observability and audit needs differ accordingly. We classify mainstream AI agents on the market into three categories:

2.1 Three Major Forms of Agent

2.2 Three Core Challenges
No matter what form is adopted, AI agents will encounter three common problems after large-scale use:

The execution process is black-boxed. The execution process of the agent involves LLM calls, tool execution, multi-round reasoning, and memory retrieval. The traditional Metrics, Log, and Trace methods cannot effectively describe this new computing paradigm. For example, in a round of Agent tasks that contain 10 rounds of ReAct reasoning, the traditional solution can only identify 10 independent HTTP requests and cannot restore a complete hierarchical and orderly decision-making process.
The behavior trajectory is difficult to trace. The agent has high independent operation permissions and can read and write local files, run system commands, and call third-party API operations. Without special audit capabilities, all operations of agents cannot be traced. This poses high risks in enterprise security and compliance control scenarios.
Cost is hard to quantify. Token consumption of large models is the main cost source of agents. Multiple rounds of iterations and tool calls will exponentially increase consumption. Without the ability to fine-tune cost splitting by agent, user, and task, enterprises will not be able to carry out budget control and input-output evaluation.

III. A Differentiated Collection Approach: Adapting to Agents' Native Runtime Forms**

Core design principle: Adapt the data collection capability to the native running mode of the AI Agent instead of forcing the Agent to adapt to the data collection tools.

3.1 Coding Agent: LoongSuite Pilot, a Lightweight Client-Side Data Collector
Coding agents run on the developer's local machine, where all core behaviors — code edits, file creation, terminal command execution — happen in the local environment, completely invisible to traditional server-side agents. To address this, we built LoongSuite Pilot, a client-side data collection platform purpose-built for coding agents.

Core Advantages

One-time deployment, full coverage. Pilot is not a solution exclusive to a single agent, but a unified platform. It currently supports five mainstream coding agents: Claude Code, Codex, Cursor, Qoder, and QoderWork. Developers only need to install it once to automatically collect data from all code assistants in use, with no repeated configuration required.
Silent background execution with zero disruption. Pilot runs as a local daemon process in the background, automatically detecting installed coding agents on the device and deploying capabilities. Developers do not need to modify agent configurations or change usage habits at any point. All behaviors, including LLM invocations, tool execution, and code modifications, are seamlessly recorded.
Resumable collection for stable and reliable data. A built-in breakpoint-resumable collection mechanism handles unstable scenarios such as network fluctuations on local devices, device restarts, and terminal shutdowns. After a process is abnormally interrupted and restarted, no data duplication or data loss occurs, ensuring data integrity.
Flexible collection granularity that balances observability and data security. Different teams have different data security requirements. Pilot supports flexible configuration of collection granularity by agent type. For complete audit needs, detailed info such as message content and tool parameters can be collected. In data-sensitive scenarios, only metadata (model name, token consumption, duration, etc.) is reported, achieving a precise balance between observability requirements and data security.
Plugin architecture, quickly compatible with new agents. Pilot uses a plugin architecture and provides out-of-the-box collection base classes for different agent data formats, such as hook logs, IDE snapshots, SQLite databases, and session files. Integrating a new Coding Agent requires implementing only 2-3 abstract methods, enabling you to quickly keep up with ecosystem iterations. Supported Coding Agents and Coverage

3.2 Personal General-Purpose Assistant: One-Line Command for Full Observability and Audit

Personal general-purpose assistants usually run as standalone services, providing end users with dialogue and task-execution capabilities. For this type of agent, we provide a dedicated plugin that enables full tracing with a single command.

Design philosophy

Take OpenClaw as an example. Although its built-in diagnostics-otel extension can output Metrics and some Trace, it adopts an event-driven architecture. Span is created independently for each event, and there is no parent-child relationship between each other and Trace Context propagation. In essence, it is a group of "standalone data points". The openclaw plug-in of LoongSuite is a complete distributed tracing by design-all Span share the same traceId and are connected together into a call tree through an explicit parent-child relationship.

Span Semantic Model

Each type of span is connected to a complete trace tree by using parent-child relationships. O&M personnel can view the number of large model calls, token consumption, tool call list, time-consuming nodes, and fault information of a single request.

Essential differences from built-in observability

Compared with the built-in observability capabilities of OpenClaw, LoongSuite plug-ins are different in two aspects:

Link integrity. Built-in observability is usually flat and independent, and there is no correlation between events. However, our plug-in is based on the OTel Context propagation mechanism to ensure that ENTRY → AGENT → STEP → LLM / TOOL forms a complete call tree, which can restore the complete picture of a request.

Data richness. Built-in observability often only records basic metrics such as model usage, while our plug-ins fully record fields such as gen_ai.input.messages, gen_ai.output.messages, gen_ai.system.instructions, gen_ai.tool.call.arguments, and gen_ai.tool.call.result to meet the needs of in-depth audit and troubleshooting.

The same plug-in mechanism already covers personal general-purpose assistants such as Hermes Agent and QwenPaw.

3.3 High-and-Low-Code Framework Agent: Zero-Code Instrumentation with the LoongSuite Python Agent
For agent applications built on frameworks such as LangChain, AgentScope, and Dify, the runtime behaves like a traditional Python application. We provide the LoongSuite Python Agent (deeply customized from OpenTelemetry Python Contrib), which achieves zero-code automatic instrumentation with a single command.

Quick start

# 1. Install the LoongSuite Python Agent pip install loongsuite-distro
# 2. Auto-detect and install the required instrumentation libraries
loongsuite-bootstrap
# 3. Start with one command; probes are injected automatically
loongsuite-instrument \
  --traces_exporter otlp \
  --service_name my-agent-app \
  python my_agent_app.py

loongsuite-bootstrap automatically scans for installed frameworks (such as langchain, dashscope, and mcp) in the current environment and installs the corresponding instrumentation packages-developers do not need to manually select and install them.

Framework Coverage

At present, 16 instrumentation libraries have been covered in the LoongSuite Python Agent, covering the mainstream AI agent development framework:

Automatically Recognized Span Types

The probe automatically detects and generates multiple GenAI span types, covering the entire agent lifecycle:

ENTRY: Request entry
AGENT: Agent execution unit
STEP: ReAct reasoning-action iteration step
LLM: LLM invocation, including request parameters, token consumption, and input/output messages
TOOL: tool calling, including tool name, parameter, and result
MCP: MCP protocol invocation
CHAIN: chained invocation orchestration
RETRIEVER: retrieval operations
EMBEDDING: embedding operations
RERANKER: reranking operations
WORKFLOW: workflow orchestration

IV. Observability and Audit Results

After accessing the preceding collection capabilities, users can obtain observability views in the following dimensions. Take Claude Code as an example. If you want to enable Agent Observability, you only need to log in to CloudMonitor 2.0 Console, click the corresponding card in the access center and follow the steps to complete the installation and access with one line of command.

4.1 End-to-End Agent Call Chain View
The complete execution process of the agent is presented in the form of a trace tree, from the user request entry (ENTRY) to the agent decision (AGENT), inference step (STEP), LLM call (LLM), and tool execution (TOOL). The hierarchical relationship is clear at a glance. For complex tasks with multiple rounds of ReAct, you can use Step Span to quickly locate which iteration has a problem, and then go to the LLM or Tool Span in the round to analyze the root cause.

Troubleshooting pattern: When an agent executes a 10-round ReAct process, you can first use Step Span to identify which round of the problem occurred, and then analyze the specific step in the round. This top-down troubleshooting method greatly improves the fault locating efficiency of complex agents.

4.2 Token Usage and Cost Tracking
Based on gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens , as well as cost fields extended by Alibaba Cloud (input_cost, output_cost, and total_cost), you can:

Token usage details for a single request
Cost aggregation by agent / user / time
Cache token fields (cache_read.input_tokens, cache_creation.input_tokens) to evaluate cache policy effectiveness

4.3 Session and Multi-Turn Conversation Tracking
Through gen_ai.session.id, gen_ai.turn.id and gen_ai.step.id to build a three-level identification system to achieve:

Full conversation traceability across multiple rounds of conversation

Step-level fine-grained analysis in a single-round dialogue

Session path analysis and user behavior insights

4.4 Tool Call Audit
You can record the tools that are called by the agent, the parameters that are specified, the results that are returned, and the duration. For the Coding Agent, this means that every file read or write and every command execution is documented. For MCP protocol calls, complete request-response auditing is also provided.

Behavior Analysis Dashboard**
**
The top count card divides tool calls into dimensions such as command execution, file reading and writing, search, web browsing, and MCP calls by behavior type, and marks the categories with abnormally high call volume with striking red or orange colors to provide a quick snapshot of the overall behavior composition. The right side displays the number of active sessions and the number of users at the same time, which is convenient for correlating the behavior popularity with the usage scale. The session statistics table below is expanded by session and records the number of calls in each session in each dimension of behavior. This allows you to locate the sessions and users in which high-frequency operations are concentrated.

Tool Call Distribution

The tool invocation distribution page presents the tool usage structure from two perspectives. The pie chart on the left shows the type proportion of all tool calls (such as Read, Write, Bash, TodoWrite, etc.) to help the team understand which tool capabilities the agent relies on most. The pie chart on the right shows the distribution of MCP tool calls independently, revealing which external capabilities are frequently called in cross-system integration. The trend comparison chart below shows the changes in the number of calls for each tool type in a timeline, making it easy to identify phased changes in call patterns-for example, a surge in Bash calls on a certain day may indicate batch script tasks or abnormal behavior.

Security Audit Overview

The Overview page compresses the security situation of AI agents into a screen-readable risk snapshot based on the multi-dimensional high-risk operation count within a specified time window. The funnel on the left side gradually converges from full sessions to sessions with security risks. This visually shows the proportion of risk surfaces. On the right side, metrics such as high-risk command execution, outbound web requests, outbound command-line requests, sensitive file access, and prompt injection are displayed side by side. With the comparison data, the security team can quickly determine whether the current risk level is abnormal without in-depth details.

What is particularly noteworthy is the count of high-risk operations after the prompt injection event. Ordinary high-risk operations may originate from the reasonable requirements of the task itself, while high-risk behaviors triggered by injection are strong threat signals-this means that the injected malicious instructions have driven the Agent to execute. Even if there is a false positive, such signals should trigger a manual review at the highest level, rather than waiting for further confirmation. Therefore, the “number of tool-calling sessions following prompt injection” is the highest-confidence Indicator of Compromise (IoC) in the entire overview. The priority of 3 such sessions is often higher than that of hundreds of ordinary high-risk commands.

High-Risk Session Tracing

Two-stage drill-down capability is provided below. The upper layer is a high-risk session risk score table, which aggregates the risk counts of each dimension (injection hits, high-risk operations, sensitive file accesses, and outbound information) by session, and automatically sorts the comprehensive risk score to present the sessions that require the most manual intervention. The security team does not need to screen logs one by one. Instead, the security team directly starts tracing from the session with the highest risk, greatly reducing the time window from discovery to response.

The lower layer is a high-risk event summary table, which drills risk down to individual event granularity-specific time, user, session, event type, tool name involved, threat type, and complete context content, providing security analysts with the original evidence required for final characterization.

V. Deep Extensions Based on the OTel GenAI Semantic Conventions

The data capabilities of the observability system of Alibaba Cloud AI Agent are built based on the self-developed LoongSuite GenAI Observability Semantic conventions. This specification is based on the OTel GenAI standard in the community and fills the semantic gaps in real business scenarios.

5.1 Why Extend Beyond Community Standards
As early as the beginning of 2024, OpenTelemetry started driving GenAI semantics specification development, aiming to establish a unified observability data language. Community standards have laid an important foundation:

gen_ai.operation.name: Standardized operation types (chat, embeddings, execute_tool, etc.)
gen_ai.span.kind: Differentiates span types such as LLM, CHAIN, AGENT, TOOL, and RETRIEVER
gen_ai.request.model / gen_ai.response.model: Model identity
gen_ai.usage.input_tokens / output_tokens / total_tokens: Token usage
gen_ai.input.messages / gen_ai.output.messages: Input and output messages
gen_ai.response.finish_reasons: Model stop reason

However, community standards inherently need to balance broad applicability with long-term stability, resulting in a relatively cautious pace of evolution. The current OTel GenAI semantic conventions is still in Development status, and many new concepts and scenarios are still being absorbed and converging.

In practice at Alibaba and Ant Group, we encountered many more complex and granular real-world scenarios. For example, a seemingly simple scenario of "ordering milk tea with Qwen" actually involves cross-domain coordination among multiple business systems, including Qwen Agent, Flash Sale Agent, Amap Agent, and Alipay Agent. These scenarios place higher demands on semantic expressiveness.

To this end, based on the OTel GenAI community standard and drawing from extensive internal hands-on experience, we released the LoongSuite GenAI Observability Semantic conventions. In 2026, the specification was officially open-sourced as a vendor extension standard for OTel GenAI, with plans to gradually contribute optimization capabilities upstream to the community.

5.2 Selected Core Extensions
Extension 1: Entry Span and Step Span — Making Complex Agent Call Chains Readable

Problem background: When an agent executes a long-running job, a single trace may contain hundreds or even thousands of spans. The native standard cannot distinguish business levels, making call chains cluttered and difficult to analyze.

Semantic Modeling:

Entry Span (gen_ai.span.kind = ENTRY ): Created at the entrance of the agent call, used to restore the original input and output of the model and the user to form the dialogue history. Ensure that when processing downstream tasks, the data is not polluted by System Prompt or framework Prompt, and the most original customer request can be obtained.
Step Span (gen_ai.operation.name = react ): represents the hierarchical expression of Agent in each ReAct process. Each ReAct completes the cycle of "reflection → tool invocation → model invocation", identifying the turn by gen_ai.react.round. The round-by-round span structure makes the trajectory of each loop clear at a glance. This semantic conventions has been implemented in multiple scenarios such as OpenClaw, QwenPaw, and Hermes Agent.

Extension 2: Skill Semantics — Making Business Function Domains Observable

Background: In scenarios such as e-commerce shopping assistants, commands are routed to the corresponding Skill after the agent understands the intent. Existing semantic conventions lack an abstraction of the business function aggregation layer of Skill.

Semantic Modeling: gen_ai.skill.* attribute family is added:

At the current stage, these attributes are attached to the execute_tool Span and quickly landed. At the same time, we have implemented an independent invoke_skill Span scheme and submitted a proposal to the OTel community (#3540).

Downstream value: Observability Platform can be aggregated and analyzed by functional domain to quickly identify "which Skill has the highest error rate", compare "whether the latency of the new version of Skill is degraded after it is launched", and measure "the proportion of Skill execution time spent on LLM calls".

5.3 Engineering Implementation: GenAI Utils
The value of semantic conventions lies not only in documents, but also in engineering implementation. We implemented GenAI Utils in the probe as an engineering capability layer for the LoongSuite SemConv:

Data extraction only at the instrumentation layer: Each framework instrumentation library intercepts framework calls by using hooks or Monkey-Patch, and fills data into the corresponding Invocation data object.
GenAI Utils unified telemetry output: All span creation, attribute mounting, metrics recording, event sending, and context management are completed by the ExtendedTelemetryHandler.
Only one specification update: When LoongSuite SemConv adds new fields or adjusts the structure, you only need to modify GenAI Utils. All downstream instrumentation libraries automatically take effect.

Supported Invocation types include LLMInvocation, InvokeAgentInvocation, CreateAgentInvocation, ExecuteToolInvocation, EmbeddingInvocation, RetrieveInvocation, RerankInvocation, and MemoryInvocation, covering the entire lifecycle of GenAI.

GenAI Utils has versions of Python, Node.js, and Go, and the Java version will be released soon. Among them, Python and Node.js versions have been open-sourced, and the rest will be open source one after another.

VI. Summary

The Alibaba Cloud Agent observability and audit solution is applicable to the following scenarios:

The popularity of AI agents has greatly improved production and office efficiency, and also put forward new requirements for observability, auditability, and governance capabilities. Different from traditional microservices and web applications, AI Agent integrates new operation modes such as LLM calls, tool execution, and multi-turn reasoning. It must support exclusive data collection and semantic standards.

The Alibaba Cloud LoongSuite solution provides full coverage for the following types of mainstream agents:

LoongSuite Pilot eliminates blackboxes for locally running coding agents such as Claude Code, Cursor, Codex, Qoder, and QoderWork.
Dedicated plug-ins (OpenClaw, Hermes Agent, QwenPaw) give personal general-purpose assistants full tracing capabilities.
The LoongSuite Python Agent, which is open source and uses 16 framework instrumentation libraries, allows agent applications developed based on frameworks such as LangChain, AgentScope, Dify, and MCP to implement zero-code access.

More importantly, the LoongSuite GenAI Observability Semantic conventions, which is based on the OTel GenAI Semantic conventions, is open source. It uses key semantic extensions such as Entry, Step Span, and Skill semantics to fill the semantic gaps of community standards in real business scenarios. With the engineering package of GenAI Utils, this ensures unified standard implementation and efficient iteration.

The ultimate goal of a unified semantic conventions is not to produce a single document, but to enable all users and vendors who use the specification to see, analyze, govern, and evolve the rapidly growing GenAI applications.

Related links:

Cloud Monitor 2.0 console: https://cmsnext.console.aliyun.com/
AgentLoop console: https://agentloop.console.aliyun.com/
Semantic conventions: https://github.com/alibaba/loongsuite-semantic-conventions-genai/
Python Agent: https://github.com/alibaba/loongsuite-python-agent

Beyond 'Demo-Grade' Architecture: Building a Highly Available Production Foundation for Dify with SAE SLS

ObservabilityGuy — Fri, 05 Jun 2026 03:10:52 +0000

This article introduces Alibaba Cloud SAE, a serverless platform that simplifies application modernization and accelerates AI deployment with zero node management.

Introduction
When facing complex microservice operations and volatile AI traffic patterns, building an elastic, maintenance-free "compute foundation" is also crucial.This article expands the scope from data architecture to full-stack infrastructure, introducing the ultimate production-grade solution built on Alibaba Cloud SAE × SLS.

With the explosive growth of LLM-powered applications, Dify—with its powerful workflow orchestration and user-friendly visual interface—is becoming the go-to platform for building enterprise AI applications. However, when applications move from local demos to large-scale production, developers often hit two "hidden" challenges: skyrocketing operational complexity and data architecture performance bottlenecks.

This article provides a deep analysis of these architectural bottlenecks and introduces the joint solution built on Alibaba Cloud SAE (Serverless App Engine) and SLS (Simple Log Service)]. Through the dual engines of "fully managed compute" and "storage-compute separation," we build a highly elastic, cost-efficient Dify production environment with deep data insights.

1.Current State and Challenges: Architectural Bottlenecks in Scaling Dify

During the single-machine demo phase, deploying with Docker Compose and the default PostgreSQL storage is perfectly adequate. But once you enter production, these two pieces of infrastructure are often the first to become performance and scalability bottlenecks.

▍Operational Complexity
Dify is a microservice architecture composed of multiple components: API service, Worker, Web frontend, KV cache, relational database, and vector database. In production, this architecture poses significant operational challenges:

· Lack of resource elasticity: AI applications typically exhibit pronounced traffic peaks and valleys. With self-managed Kubernetes or ECS clusters, scaling responses lag behind demand—users queue during peaks, while massive resource waste occurs during off-peak hours, driving up costs.

· High maintenance costs: Ensuring high availability, configuring load balancing, handling node failures, and performing blue-green or canary deployments—this foundational infrastructure work carries a high technical bar and consumes significant engineering effort that should be spent on business innovation.

· Performance bottlenecks: The default deployment provides limited QPS capacity, making it difficult to support high-concurrency scenarios—especially under inference-intensive workloads, where it easily becomes a system bottleneck.

▍Database Capacity Explosion
By default, Dify stores all data—including business metadata and runtime logs—in PostgreSQL. As business volume grows, the mismatch between data characteristics and the storage engine becomes increasingly apparent:

• Logs "bloat" the database: Every workflow node execution generates a complete record of inputs, outputs, prompts, reasoning processes, and token statistics. In high-concurrency production scenarios, this data consumes the vast majority of database resources, causing tablespace to expand rapidly.

• Core business degradation: High-frequency, high-throughput log writes consume database connection pools and I/O resources, severely interfering with core business operations (such as creating applications, knowledge base retrieval, and conversation context management), leading to response delays, timeouts, and even service unavailability.

2.Synergistic Empowerment: SAE and SLS Core Advantages

To address these bottlenecks, SAE and SLS work in tandem—SAE focuses on elastic compute scheduling, while SLS specializes in massive log storage—together building a high-performance, highly available runtime foundation for Dify.

▍SAE: A Fully Managed, Elastically Scalable Runtime for Dify
SAE handles more than just orchestrating Dify's core microservices (API, Worker, Sandbox). Through one-click templates, it integrates the complete cloud ecosystem required to run Dify.

• One-click full-stack delivery: Developers no longer need to manually build complex environments. Using pre-built templates, you can deploy a complete microservice cluster with a single click, automatically creating and integrating SLS (workflow log storage), Tablestore (vector storage), Redis (caching), and RDS for PostgreSQL (metadata storage)—no need to purchase and configure each service individually, delivering a "production-ready out of the box" experience.

• Enterprise-grade high availability: Instances are automatically distributed across multiple availability zones, combined with health checks and self-healing mechanisms to prevent single points of failure. Canary deployments ensure smooth, seamless traffic shifts during frequent workflow iterations.

• Sub-second compute elasticity: A perfect fit for the "tidal" characteristics of AI workloads. SAE supports auto-scaling based on CPU/memory utilization or QPS metrics. During inference peaks, Worker instances spin up in seconds to absorb pressure; during off-peak periods, idle resources are automatically released, keeping compute costs strictly within the "actual usage" range.

• Deep performance tuning: SAE has applied end-to-end, code-and-architecture-level tuning to Dify—not only patching Redis cluster compatibility and slow SQL issues at the infrastructure layer, but also fine-tuning runtime parameters and aligning resource specifications. This full-stack optimization drives a 50x throughput leap from 10 QPS to 500 QPS, ensuring silky-smooth AI responses.

▍SLS: A "Storage-Compute Separation" Solution for Massive Data
SLS is not simply a database replacement—it is cloud-native infrastructure purpose-built for log scenarios. Compared to PostgreSQL, SLS delivers architectural upgrades across four dimensions in the Dify context:

• Extreme storage elasticity: Unlike databases that require resource provisioning based on peak loads, SLS as a SaaS service natively supports sub-second elastic scaling. Whether it's a late-night trough or a sudden inference spike, it adapts automatically—no need to worry about sharding or capacity limits.

• Architectural decoupling and load isolation: By leveraging append-only write patterns, SLS avoids the random I/O and lock contention common in databases, easily supporting 10,000+ TPS throughput. By completely offloading the log workload to the cloud, it ensures that massive log writes do not affect Dify's core business response times.

• Tiered storage for cost-efficient retention: Powered by high compression ratios, hot data is analyzed in real time while cold data automatically sinks to archive storage. This meets long-term audit and retrospective needs at costs far below database SSD pricing.

• Out-of-the-box business insights: The built-in OLAP analysis engine supports real-time SQL queries, visual dashboards, and alert monitoring, helping developers transform dormant log data into actionable business insights.

3.Effortless Deployment: Define a Production-Grade Foundation in 1 Minute

The SAE App Center includes a deeply optimized Dify production template. With simple parameter configuration, you can deploy a highly available runtime environment in a single click—no more tedious YAML writing and environment debugging.

Step 1: Select a deployment template
Log on to the SAE console, go to the App Center, and select "Dify Community Edition - Serverless Deployment."

Step 2: Configure parameters and select specifications
Three templates are currently available: Dify High-Performance Edition, Dify High-Availability Edition, and Dify Test Edition.

For high-concurrency production scenarios, we recommend the Dify High-Performance Edition, which includes deep optimizations specifically for the api image and plugin-daemon image, resulting in higher runtime efficiency. Configuration is streamlined—simply fill in the passwords for each cloud service and select the VPC and vSwitch. The system then provides a total estimated price for the selected cloud resources, ensuring cost transparency.

Step 3: Submit and access the service
Click Submit, and the system automatically completes the deployment of core services and cloud resource associations.

After deployment, enter the service address provided by the console—${EXTERNAL-IP}:${PORT}—directly in your browser to begin your Dify application orchestration journey.

Note: After Dify starts and is running, the SLS plugin automatically creates the relevant logstores and index configurations. No manual intervention is required—simply navigate to the corresponding project in the SLS console to query and analyze workflow logs in real time.

4.50x Performance Leap: SAE's Journey from 10 QPS to 500 QPS

Dify Community Edition's default configuration supports only 10 QPS, but that's just the starting point. Scaling from "getting started" to 500 QPS production capacity isn't a matter of simply throwing more server resources at the problem—it's a step-by-step "boss fight." Every time you try to increase throughput, you hit a new invisible ceiling—from basic parameter limits to deep architectural bottlenecks. The SAE team used full-stack load testing to map out and conquer the two core checkpoints on this progression, making high-performance deployment a well-charted path.

▍Bottleneck 1: Breaking the 10 QPS Limit—Coordinated Tuning of Component Concurrency and Database Connections
1.Why does the default configuration cap at 10 QPS?
Dify Community Edition's default configuration is designed for quick developer tryout, not large-scale production. The default parameters for its core component dify-api are extremely conservative:

SERVER_WORKER_AMOUNT (worker processes): 1
SERVER_WORKER_CONNECTIONS (max connections per process): 10

These two parameters directly cap the throughput of a single node. But in production, you cannot simply "multiply by ten"—increasing application-layer concurrency immediately triggers a chain reaction in downstream databases.

2.The "connection pool" domino effect
As QPS grows, components like dify-api and dify-plugin-daemon open massive numbers of connections to PostgreSQL. Without end-to-end parameter coordination, the system easily collapses:

• Connection exhaustion: PostgreSQL has a finite total connection limit. Blindly increasing component concurrency drains database connections, causing subsequent requests to fail outright.

• Connection contention between components: SQLAlchemy's connection pool uses a "lazy loading" mechanism, and idle connections are not released until they expire. If misconfigured, non-critical components can hoard large numbers of idle connections while critical components starve for resources during peak traffic.

Solution: A battle-tested "production-grade configuration matrix"
To prevent users from falling into a cumbersome parameter trial-and-error cycle, the SAE team conducted multiple rounds of full-stack load testing in real production environments. They identified the production-grade configuration matrix mapping API concurrency, database connection pool sizes, and component resource specifications across different traffic tiers. Users don't need to worry about parameter calculations—simply select the specification tier matching your estimated traffic to ensure every unit of compute translates into actual business throughput.

Note: The load testing scenarios do not include the code execution (Code Sandbox) path. Please evaluate and adjust the specifications and quantity of the dify-sandbox component based on the complexity of code execution in your actual business.

Configuration reference: https://help.aliyun.com/zh/sae/dify-performance-optimization

▍Bottleneck 2: From 200 QPS to 500 QPS — Redis Single-Point Bottleneck and Read-Write Separation
1.Integrating ARMS tracing to identify performance bottlenecks
After optimizing database connections and stabilizing QPS at 200, the system throughput could not be pushed further. To locate the bottleneck, the SAE team used ARMS application monitoring deeply integrated into the SAE platform to perform trace analysis on the dify-plugin-daemon component—on the SAE console's application details page, click "Application Monitoring" to view the slowest call chains.

Trace data revealed that downstream Redis SET/DEL operations were failing frequently. The SAE team attempted to vertically scale the Redis instance to the maximum specification (64 cores), but the effect was minimal: the QPS ceiling did not improve, indicating that the bottleneck was not in capacity, but in the single-point architecture itself.

2.dify-plugin-daemon's high-frequency Redis reads and writes causing single-point congestion
Code analysis revealed that this was a conflict between Dify's business logic and Redis's single-point architecture:

• dify-plugin-daemon generates a new Session ID for every data pipeline request and writes it to Redis. This session data is then read and verified on every subsequent request. This creates a pattern of high-frequency, small-payload read-write operations concentrated on a single key space.

• In the default architecture, all session read-write requests are concentrated on a single Redis node. Under 200+ QPS high-concurrency pressure, the single node becomes a throughput bottleneck—not due to insufficient memory, but because the network I/O and single-threaded command processing of a standalone Redis instance cannot handle the concurrent connection load.

Solution: Cluster transformation for read-write separation
To break through the single-machine architecture limitation, the SAE team went deep into the component internals and performed cluster adaptation for dify-plugin-daemon:

• Cluster protocol support: To address the native component's lack of Redis Cluster support, the SAE team modified the underlying code to fully support the Redis Cluster protocol, including hash-slot-aware key routing and cluster node auto-discovery.

• Read-write separation: Through architectural upgrade, the massive requests originally concentrated on a single machine were distributed across the cluster. The cluster's multi-node characteristics enable load distribution and read-write separation.

This transformation completely eliminated the single-point bottleneck, successfully supporting a smooth throughput increase from 200 QPS to 500 QPS.

5.Unlocking Full-Stack Data Value: SLS Transforms "Black Box Operations" into "Deep Insights"

Once Dify is live, how do you assess model costs and performance? How do you analyze business trends? Powered by SLS's robust OLAP analysis engine, you can perform deep mining of Dify's workflow logs without pre-defining table schemas, building comprehensive dashboards covering both technical and business metrics.

▍Infrastructure Perspective: LLM Cost and Performance Transparency
For Dify's LLM nodes, the process_data field in workflow_node_execution logs contains detailed model invocation data, enabling sub-second multi-dimensional analysis of model usage.

Scenario A: Token Consumption and Cost Auditing
Real-time monitoring of token consumption trends is key to controlling AI costs. You can track input tokens (prompt_tokens), output tokens (completion_tokens), and total tokens over time, precisely identifying anomalous traffic.

Sample SQL:

node_type:llm | select
  sum(
json_extract_long(process_data, '$.usage.prompt_tokens')
) prompt_tokens,
sum("process_data.usage.completion_tokens") completion_tokens,
sum("process_data.usage.total_tokens") total_tokens,
date_trunc('minute', __time__) t
group by
  t
order by
  t
limit
  all

Note: Fields within JSON can be extracted directly in SQL using json_extract_xxx functions, such as json_extract_long(process_data, '$.usage.prompt_tokens'). For frequently used fields, we recommend creating additional JSON sub-indexes so you can reference the column name directly in SQL, such as "process_data.usage.completion_tokens", for more efficient statistical analysis.

Scenario B: Time-to-First-Token (TTFT) Percentile Analysis
LLM response speed directly impacts user experience. By analyzing the P50, P90, and P99 percentiles of time_to_first_token, you can objectively evaluate model response stability under different loads, providing data support for model routing or inference acceleration decisions.

Sample SQL:

node_type:llm| select
  date_format(__time__-__time__ % 60, '%m-%d %H:%i') as time,
   approx_percentile("process_data.usage.time_to_first_token", 0.25) as Latency_p25,
  approx_percentile("process_data.usage.time_to_first_token", 0.50) as Latency_p50,
  approx_percentile("process_data.usage.time_to_first_token", 0.75) as Latency_p75,
  approx_percentile("process_data.usage.time_to_first_token", 0.99) as Latency_p99,
  min("process_data.usage.time_to_first_token") as Latency_min
group by
  time
order by
  time
limit
  all

▍Business Operations Perspective: User Intent and Conversion Insights
Beyond low-level model metrics, SLS can help you understand business logic at a deeper level. Using an "e-commerce AI customer service assistant" Dify application as an example, you can use SQL to dissect workflow node inputs and outputs to support operational decisions.

Scenario A: User Intent Distribution Trends
By analyzing the output of the "intent recognition" node in the workflow, you can quantify the most frequent user inquiry categories (e.g., returns/exchanges, shipping inquiries, coupons), and observe how these demands change over time—guiding knowledge base optimization efforts.

Sample SQL:

* and title: User intent recognition | select
  json_extract(outputs, '$.text') as "user intent",
  count(1) as pv
group by
  "user intent"

Scenario B: Anomaly Diagnosis and Funnel Analysis
By tracking error rates for specific nodes or analyzing the downstream flow of specific intents, you can build funnel charts to quickly identify nodes causing user drop-off. For example, analyzing the "empty result" rate of the "product search" node can indicate whether the product knowledge base needs expansion.

You can use funnel charts to analyze and observe which intermediate workflow nodes have a high failure rate.

Sample SQL:

status:succeeded | select
title,
count(distinct workflow_run_id) cnt
group by
  title
order by
  cnt desc

6.Conclusion: Let AI Applications Focus on What Matters

From "functional" to "production-ready," Dify's journey to production-grade deployment requires solid infrastructure support. The SAE × SLS joint solution is not just a simple combination of two cloud products—it delivers a full-stack Serverless architectural transformation for Dify through deep integration of "compute management" and "storage decoupling":

• Full-stack elasticity: The compute layer scales in seconds with traffic, the storage layer handles burst throughput effortlessly—a perfect match for the tidal characteristics of AI workloads.

• Structural cost reduction: Eliminates idle resource waste completely. Replaces expensive database expansion with low-cost tiered storage, maximizing ROI.

• Extreme stability: A fully managed, maintenance-free foundation combined with physical I/O isolation completely eliminates single-point-of-failure risks and database performance black holes.

• Deep insights: Breaks the "black box" between infrastructure monitoring and business data analytics, using token cost and user intent data to fuel business evolution.

With this solution jointly released by SAE and SLS, Dify developers no longer need to worry about underlying resources and architecture. A single, simple configuration gives you a highly available, high-performance, cost-efficient AI application environment—allowing you to truly focus on business innovation and prompt tuning.

Get started now: Log on to the Alibaba Cloud SAE console[1], go to the App Center, search for the Dify template, select the Dify High-Performance Edition, and start your one-click managed deployment journey.

▍https://alidocs.dingtalk.com/i/nodes/gvNG4YZ7Jnxop15OC9ZogOKgW2LD0oRE?utm_scene=team_space
Alibaba Cloud Serverless App Engine (SAE) is a one-stop containerized application hosting platform built for the AI era, with the core philosophy of "supporting traditional applications and accelerating AI innovation." It simplifies operations, ensures stability, reduces costs by up to 75% through idle resource optimization, and enhances operational efficiency through an AI-powered assistant.

For AI workloads, SAE integrates mainstream frameworks like Dify, supporting one-click deployment and elastic scaling. In the Dify scenario, it achieves a 50x performance improvement and over 30% cost optimization.

Product Strengths
With eight years of technical refinement, SAE was named a Global Leader in the 2025 Gartner Magic Quadrant for Cloud-Native Platforms—ranked #1 in Asia—helping enterprises achieve zero node management and focus purely on business innovation. SAE serves as both a "hosting platform" for traditional application modernization and an "acceleration engine" for large-scale AI application deployment.

1.Traditional Application Operations: The "Simplify, Stabilize, Save" Approach

• Simplify: Zero operational overhead — focus on business innovation

• Stabilize: Enterprise-grade high availability with built-in comprehensive protection

• Save: Extreme elasticity that brings costs down to measurable levels

2.Accelerating AI Innovation: From Rapid Exploration to Efficient Deployment

• Rapid exploration: Built-in templates for Dify, RAGFlow, OpenManus, and other popular AI applications — ready out of the box, with POC up and running in minutes;

• Reliable deployment: Production-grade AI runtime with performance optimizations (e.g., 50x performance boost for Dify), seamless upgrades, and multi-version management for enterprise-grade reliable delivery;

• Easy integration: Deep integration with gateways, ARMS, metering, and auditing capabilities to accelerate the intelligent transformation of traditional applications.

Who is it for?
✅ Startups: No dedicated ops team, need to launch quickly
✅ SMBs: Looking to cut costs and embrace cloud-native
✅ Large enterprises: Requiring enterprise-grade stability and compliance
✅ Global businesses: Needing China + worldwide deployment
✅ AI innovation teams: Looking to rapidly deploy AI applications

Learn more
Product page: https://www.alibabacloud.com/product/severless-application-engine

Related Links:
[1] Alibaba Cloud SAE console
https://saenext.console.aliyun.com/overview?accounttraceid=db100a4af9c7405e88dcfb89e81c5281ibby

Alibaba & Ant Group LoongSuite GenAI Observability Semantics Specification: From Unified Data Language to Large-scale Implementation

ObservabilityGuy — Fri, 05 Jun 2026 02:32:06 +0000

This article introduces LoongSuite GenAI SemConv, a unified observability specification extending OpenTelemetry with enhanced semantics for AI agents, skills, and token-level inference.

Background
With the rapid development of AI, especially generative AI (GenAI), a large number of new core concepts emerge in AI Agent systems, such as models, prompts, tokens, tool calling, agents, memory, and sessions. These concepts have become the observation objects that algorithm engineers, O&M engineers, and observability platform users care about the most. They need to be collected, displayed, and consumed in a standardized manner, in the same way as HTTP requests and database invocations in traditional systems. This allows system maintainers to clearly understand the invocation procedure and efficiently troubleshoot issues.

Based on this, OpenTelemetry (OTel) began to promote the construction of GenAI semantics specifications as early as the beginning of 2024. It hopes to establish a unified data collection specification, Semantic Conventions (SemConv), for these new objects. This aims to solve problems in related realms, such as the lack of observable data collection standards and inconsistent calibers.

SemConv Positioning and Value
Observable data collection tools, such as auto instrumentation or SDKs for various languages such as Java, Go, and Python, may be considered the core value of the OTel community by many people who are new to OTel.

However, after you deeply understand the community, you will find that compared to SemConv, these collection capabilities play more of a role of "tactics." They serve the true "philosophy" of OTel, which is to establish a unified observable data language through SemConv. **OTel SemConv is a set of observable data collection standards jointly designed and continuously evolved by dozens of top observability vendors and hundreds of realm experts around the world. **Over the past few years, after communicating with core maintainers and co-founders of the community at multiple KubeCon conferences, we learned that in their eyes, SemConv is the soul of OTel. Promoting its gradual improvement and moving towards Stable is the most important work of the community.

A unified observability SemConv can achieve the following effects:

Unified data language to resolve inconsistent calibers

Take GenAI semantics as an example. Its common scenarios naturally span across models, frameworks, and platforms. When there is no unified semantics specification, different teams often record information such as "model name," "input length," "token count," and "response content" separately. Field naming and statistical calibers cannot be aligned. The core value of OTel GenAI SemConv lies in providing standardized fields for these common concepts, such as gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens.

Once these key fields are standardized, different businesses, different infrastructures, and different observation backends can share the same analysis method. This truly achieves "explaining the same category of problems with the same set of data." This is also the most basic and important value of semantics specifications.

Support the unified administration of performance, cost, quality, and security

The target of observability construction is not only troubleshooting but also the continuous governance of performance, efficiency, security, and output behavior. For example, in the GenAI SemConv scenario, only after the unified SemConv standardizes key information such as model parameters, response metadata, and token usage, can the team more easily track performance, cost, and security-related issues.

For large enterprises, this means that the following practical demands can be resolved based on a unified standard:

● Technical troubleshooting: You can view the complete trace across agents through the Trace ID, and locate various problems at the minute level, such as abnormal invocation latency of a certain business model.

● Business analysis: Effect data is comparable across businesses and can be directly used for product decisions. This greatly improves the efficiency of roles such as BI, product, and data science when they perform cross-business analysis.

● Evaluation:The real user trajectories are continuously accumulated to automatically build evaluation datasets, especially for the end-to-end evaluation of multi-agent collaboration scenarios.

● Compliance: A unified audit trace meets the rigid requirements of security ICP filing.

If there is no unified semantics, these problems can only be analyzed locally within a single system, and group-level administration capabilities cannot be formed.

Reduce access costs and promote infrastructure reuse

One of the design Targets of OpenTelemetry (OTel) is to allow telemetry Data to reuse the same Collection and administration link through components such as standard protocols, semantics specifications, SDK, automatic instrumentation, and Collector. In Generative Artificial Intelligence (GenAI) scenarios, the value of unified semantics specifications is particularly evident here: once fields, Span structures, event models, and context propagation methods are clearly defined, non-intrusive instrumentation, SDK encapsulation, platform Analysis, Dashboards, and alert policies can all be reused.

This means that businesses do not need to start thinking about "what fields to collect" every time. Instead, businesses can directly integrate capabilities based on existing specifications to reduce overall construction costs.

Introduction to LoongSuite GenAI SemConv
Background
As the current de facto standard in the observability industry, although OTel started the discussion and design of GenAI semantics specifications as early as early 2024, the overall Update pace is relatively slow because the early human resource investment was limited and the community standard emphasized broad applicability and long-term stability. In contrast, Alibaba Group has a large number of Large Language Model (LLM) application implementation scenarios internally and has encountered a large number of case problems in real scenarios. Therefore, Alibaba Group has the requirement to abstract related problems into a unified standard.

2025: The observability teams of Alibaba Cloud, Alibaba Holding, and Ant Group jointly Started to perform semantics modeling on the Content that OTel has not yet covered in internal scenarios based on OTel GenAI semantics, and promoted the implementation and application of internal observability Collection tools based on this.

2026: After the communication with the main Maintainers of GenAI in the OTel community is completed, because the related Content is extensive and the iterations are fast, under the suggestion of the community Maintainers, the results are first open sourced under the Alibaba LoongSuite observability Brand as a vendor enhancement standard for OTel GenAI SemConv, and will be gradually contributed to the OTel upstream at an appropriate time later.

Content and Implementation
Currently, this specification has been implemented in multiple core scenarios within the group, forming full-stack observability capabilities from the Agent layer to the infrastructure layer. For example, the following is some enhanced Content of the related Loongsuite GenAI SemConv compared to OTel GenAI SemConv:

New Entry/Step Span
Problem Background
In the practice procedure of AI Agent, we found that when the Agent executes long-term Jobs, the execution logic of the Agent becomes increasingly complex. It will contain multiple rounds of tool calling and model invocations, causing a single Trace to contain hundreds or thousands of Spans. These Spans appear very lengthy when the Spans are displayed in the same link, making it difficult to clearly observe the invocation chain trajectory. To solve this problem, we introduced the following two key designs:

Entry Span: A Span is created at the entry point of the Agent invocation, and is used to revert the original inputs and Outputs of the model and the User to form a dialogue History. This ensures that when Downstream Tasks are executed, the processed Data is not interfered with by the System Prompt or the frame Prompt, and the most original Customer Requests can be retrieved.
Step Span: Step represents the hierarchical expression of the Agent during each ReAct procedure. During each ReAct procedure, the Agent needs to complete the loop of "reflection → tool calling → model invocation". When problems are troubleshooted, a Top-down approach is usually adopted to locate the execution status of the Agent. The specific flow is: you can first observe the overall situation. For example, when the Agent executes a procedure containing 10 rounds of ReAct, you can first locate which round has a problem, and then deeply analyze which specific step in that round is wrong. Through this round-by-round Span structure, the multiple rounds of actions, reflections, and corresponding execution Results of the Agent can be clearly displayed, making the trajectory of each loop clear at a glance. Semantics Modeling The definitions of the newly added Entry and Step Span Types are as follows:

Implementation Effect
Currently, this semantics specification has been implemented in multiple Agent scenarios, including OpenClaw, QwenPaw, and Hermes Agent. The following is the effect after the semantics specification is implemented and integrated in the OpenClaw scenario:

New Skill Semantics
Problem Background
In Agent scenarios such as E-commerce shopping assistants, after the intent of each instruction of the User is understood by the AI Agent, the instruction is routed to the corresponding Skill to complete the execution. A Skill is the smallest reusable unit of business features, which internally orchestrates a group of LLM invocations and tool callings to complete specific Jobs, such as searching for Products, adding to the shopping cart, and requesting Refunds.

Existing OpenTelemetry (OTel) Generative Artificial Intelligence (GenAI) semantics conventions have covered Span Types such as Agent, Large Language Model (LLM), and Tool, but lack abstraction for the business feature aggregation layer of Skill. A Skill is neither a single Tool invoke nor a complete Agent, but an orchestration unit between the two. The lack of observability in the Skill dimension means that when Performance Fluctuation occurs, you can only see a heap of execute_tool and inference Spans. The lack of Skill observability leads to three core pain points:

● Inability to Attribute to the feature domain: When Performance Fluctuation occurs, you can only see a heap of execute_tooland inference Spans, and you cannot quickly determine which feature domain has a fault.

● Inability to calculate Skill health Metrics: Metrics such as P99 latency, Succeeded rate, and invoke frequency at the Skill granularity are missing.

● Trace obfuscation when multiple Skills are concurrent: The ownership of LLM or Tool Spans of different Skills cannot be distinguished in the Trace tree.

Semantics Modeling
To implement the Collection of Skill information, we added a group of gen_ai.skill.* properties in LoongSuite GenAI SemConv to identify the identity and Version information of a Skill:

At the current stage, these properties are attached to the existing execute_tool Span, which can be quickly implemented without the need to Import new Span Types.

At the same time, based on the group business, we implemented the solution of an independent invoke_skill Span, and committed a proposal to the OTel community to cover the complete lifecycle of a Skill from load to execution completion, supporting end-to-end Analysis by feature domain.

Implementation Effect
Through the Skill semantics properties, the observability platform can perform aggregation and analysis by feature domain: quickly locate "which Skill has the highest Error Rate", compare "whether the latency deteriorates after the new Version of Skill is published", and measure "the proportion of LLM invoke Duration to the total Skill Duration".

In addition, the same set of gen_ai.skill semantics conventions can also cover various frames, such as OpenClaw, Langchain, and Spring AI. The following is the instrumentation effect in the OpenClaw scenario:

New Token-level Inference Observation
Problem Background
In the first half of 2025, the Ant observability team built a full-link observability system around the Ant inference Alibaba Cloud service, covering the core widgets of the inference Alibaba Cloud service, and Built multi-language and multi-protocol distributed tracing Trace capabilities from the client to the DPI engine end. Among them, Ant collaborated with the Alibaba Cloud team to contribute basic DPI engine observability Traces to the community's three major inference DPI engines, vLLM, SGLang, and TensorRT-LLM, forming a de facto observability Trace standard at the Ant and Alibaba Group level. The entire observability system is an important stability foundation for the Ant inference Alibaba Cloud service.

However, with the vigorous development of the business, the pressure on the inference Alibaba Cloud service has intensified, and a large number of difficult problems related to the inference DPI engine have exhibited emergent behavior. The DPI engine Trace at the Request level can no longer effectively locate problems at a deeper depth. We deeply studied the underlying principles of the inference DPI engine, combined with actual production cases, and summarized the following problems:

Performance abnormality: The slow response of a single Request is often because certain Tokens are slow to Generate, and the slow Generation of Tokens is highly likely caused by the concurrent interference of other Requests.
Precision abnormality: Precision problems such as repetition, irrelevant answers, and garbled characters often start to be abnormal from a certain Token, and subsequent Tokens continue to make faults under this Impact. Therefore, the essence of the problem lies in the Token Generate procedure. From this, it is naturally inferred that the localization and demarcation of inference request problems must be supported by Token-level observable data.

Therefore, in the second half of 2025, the Ant observability team took the lead to Build the industry's first observability product that covers multiple inference DPI engines and Supports Token-level depth Trace, sinking observability from the macro Request down to the micro Token dimension. It not only follows whether a single Request Succeeded, but also deeply observes:

The Generate Duration and sub-stage procedure of each Token.
The mutual impact of multiple concurrent requests within the same infer instance when slow tokens are generated.
The Top-K candidate distribution behind each generated token helps pinpoint accuracy issues.
The core value of this work lies in that it decomposes many originally "black box" procedures inside the infer engine down to the token granularity for the first time, creating a transparent, explainable, and attributable white-box System.

Semantics Modeling
Brief introduction to how the infer engine works: The infer engine is essentially a System that executes an infinite loop of iterations. In each iteration, a batch of requests is selected based on resource conditions and the schedule policy to form a Batch, which serves as the execution Target of the current iteration for Batch Processing. After the iteration is completed, each selected Request usually generates a token. Then, it enters the next iteration, going through the same procedure of selecting requests to form a batch and then executing the batch. This loop continues in this way.

Token Performance Data Collection: At the token granularity of each Request, we collect the UNIX timestamps for entering and exiting the iteration. With these two UNIX timestamps, the scheduled time, actual running time, and total Duration of User Perception for each token can be deduced. In addition, the Request corresponding to each token is in a Batch. The total number of requests in the Batch (especially the total number of tokens) characterizes the payload of the batch processing, which further determines the Duration of token generation. Therefore, we define the following related properties that characterize the Performance Data at the token granularity:

Token accuracy data Collection: At the token granularity of each Request, we collect the probability distribution of the candidate Top-K tokens corresponding to each token. This distribution can be used to judge the Outputs quality of the model. For a model with poor quality, its Top candidate tokens are less likely to meet expectations. If the model Outputs meet expectations but the selected token is not in the Top-K, the issue points to the sampling parameters specified by the User, such as temperature. Therefore, we define the following properties related to the candidate token probabilities:

Implementation Results
Based on the GenAI specifications designed above, we collect and output standard Data on three major engines. Relying on this standard Data, a consistent feature interface is presented to the User. Ultimately, we have built an engine microscope product to provide the depth observation capabilities of the infer engine at the engine concurrency and token levels.

● Engine token Analysis: You can switch to a high-power microscope, focus on a single Request, and observe the Duration of each step in its internal token generation, as well as the probability distribution of the top candidate tokens, to accurately pinpoint the root cause of latency and abnormal accuracy issues.

● Engine concurrency profiling: You can use a wide-angle lens to clearly render the concurrency, competition, and collaboration relationships of all requests in the engine, and quickly detect resource contention and bottlenecks.

The token-granularity Performance Data from the engine token Analysis can reveal which tokens are slow. The engine concurrency Analysis further answers why these tokens are slow. In addition, the probability distribution Data at the token granularity can reveal whether the Large Language Model (LLM) Outputs of abnormal tokens are Normal or the sampling parameters setting is unreasonable. After the product was published, it went through the year-end sales promotions and successfully helped the engine, SRE, and business teams pinpoint multiple stability issues on the stability battlefield, accelerating the issue demarcation efficiency by 10 times. It truly achieved both speed and accuracy, and further provided optimization suggestions. Some typical cases are selected below to illustrate the product features and business value.

Case 1: Slow token localization and quick detection of cross-Request resource interference
You may often encounter a specific Request breaching the threshold in the production environment, such as the TPOT (Time Per Output Token) indicating the token Outputs speed breaching the threshold. For the User, this will be perceived as stuttering in the Outputs. The following case describes how the token Analysis and engine concurrency profiling help demarcate and pinpoint the issue in this scenario.

After we obtain the TraceId of the abnormal Request, we open the token Analysis Page as shown in the following graph. We can see that the 125th token took 6.8 s, which far exceeds expectations, ultimately causing the TPOT to reach up to 54.77 ms.

You can click Engine Concurrency Analysis in the upper-right corner of Token Analysis, and you are redirected to the concurrent profiling page of the corresponding engine instance. You can search for and locate the abnormal request based on Time or TraceId. This request is Request 2 in the following graph. We can see that Request 1 spent more than 6 s to generate the first Token (prefill phase) - the bright green block, which interrupted Request 2 to decode and generate the 125th Token (the yellow block). This is consistent with Token Analysis. In summary, the root cause is that the prefill of requests from other tenants interrupted the decode procedure of the current request. A possible solution is to perform PD separation to prevent the prefill and decode of different requests from affecting each other.

Case 2: Token-level observation to accurately locate the root cause of irrelevant answers
The following case is a typical "irrelevant answer" case. For example, the user asks a medical question, but the Large Language Model (LLM) replies with a LeetCode solution.

You can open the Token Analysis page of the abnormal Trace as shown in the following graph, and we can see at a glance that the first Token is "begin_of_sentence". This Token is a special Token, abbreviated as BOS. It is used to separate two completely unassociated corpora. In other words, once BOS appears, the subsequent answer is completely unassociated with the previous prompt, and naturally the answer is irrelevant. Therefore, it is obvious that BOS should not appear in the answer under any circumstances. Then the problem is delimited to why this BOS appears. For this case, "begin_of_sentence" will not be displayed in the reply of the user, the engine log, or the gateway log. Instead, it will only be displayed as an empty string. Therefore, without Token Analysis, the localization procedure will become complicated. Later, we further investigated and discovered that the output of BOS is a bad case of the LLM. The solution is to adjust the model or wait for subsequent model version optimization and Update.

Use GenAI Utils to quickly implement LoongSuite GenAI SemConv
Background
In the previous text, we introduced the semantics modeling of LoongSuite GenAI SemConv in multiple dimensions such as Agent, Skill, and Token Level Inference in detail. However, for developers of various Instrumentation libraries that implement LoongSuite GenAI SemConv, they face a common engineering challenge:

Each GenAI framework Instrumentation library needs to implement a complete set of telemetry Collection logic—creating Spans, mounting semantics properties, recording Metrics, sending Events, and managing Context propagation—and this logic is highly repetitive among different framework Instrumentations. More importantly, when the semantics specification is iteratively upgraded (such as adding fields or adjusting the Span structure), if each Instrumentation library maintains its own implementation, the upgrade cost will increase exponentially.

Take an Agent framework Instrumentation as an example. If a common tool layer is not used, the developer needs to manually complete the following operations: create the invoke_agent Span and set SpanKind, mount dozens of properties such as gen_ai.agent.name, gen_ai.agent.id, and gen_ai.usage.input_tokens one by one, decide whether to collect the message Content based on the configuration, handle abnormal situations and set the Error Status, and record Duration and Token Usage Metrics. This boilerplate Code is similar in each Instrumentation library.

To solve this problem, we implemented GenAI Utils in the probe. As the engineering capability layer of LoongSuite GenAI SemConv, it encapsulates the complexity of the semantics specification into concise APIs, so that Instrumentation library developers only need to focus on "what Data to fetch from the framework", without worrying about "how to Output telemetry Data according to the specification". The following are some GenAI Utils implementations that we Support:

The corresponding implementation for LoongSuite Python is LoongSuite-utils-genai.
The corresponding implementation for LoongSuite JS is LoongSuite-utils-genai. Architecture Design The overall architecture of GenAI Utils follows the design principle of "layered decoupling and unified convergence":

*Core design concepts:
*
The Instrumentation layer only performs Data extraction: Each framework Instrumentation library intercepts framework invocations through Hook or Monkey-Patch, and populates the Data into the corresponding Invocation Data object, without directly operating the OTel API.

GenAI Utils unifies the convergence of telemetry Outputs: All Span Creation, property mounting, Metrics recording, Event sending, and Context Management are completed internally by ExtendedTelemetryHandler.

Only one modification is required for a specification upgrade: When new fields are added or the structure is adjusted in LoongSuite GenAI SemConv, you only need to modify the Span Utils and Metrics modules in GenAI Utils, and all downstream instrumentation libraries automatically take effect.

API Usage
GenAI Utils provides the corresponding Invocation data class and Context Manager method for each GenAI operation covered by LoongSuite GenAI SemConv. This forms a unified "populate data + hand over to Handler" programming model. Next, you can take the GenAI Utils tool library in Python as an example to see how to use it:

Step 1: Obtain a Handler singleton

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler  

handler = get_extended_telemetry_handler(  
    tracer_provider=tracer_provider,  
    logger_provider=logger_provider,  
)

ExtendedTelemetryHandler inherits from the upstream TelemetryHandler of OpenTelemetry (OTel) (which is responsible for basic Large Language Model (LLM) operations), and based on this, it extends the new operation types added by LoongSuite, such as Agent, Tool, Embedding, Retrieve, Rerank, and Memory. It also integrates multimodal asynchronous processing capabilities. This inheritance design ensures that no conflicts occur during synchronization with the upstream community code.

Step 2: Select the corresponding Invocation data class, and populate the business data
GenAI Utils defines the corresponding Invocation data class for each operation. Instrumentation library developers only need to populate it with the data fetched from the framework:

Step 3: Use Context Manager to complete telemetry outputs
You can take the typical Agent framework instrumentation as an example to see how to use GenAI Utils to quickly implement complete observability collection:

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai.extended_types import (
    InvokeAgentInvocation, ExecuteToolInvocation
)
from opentelemetry.util.genai.types import InputMessage, OutputMessage, Text
handler = get_extended_telemetry_handler()
# ========== Agent invocation ==========  
with handler.invoke_agent() as invocation:
    invocation.provider = "dashscope"
    invocation.request_model = "qwen-max"
    invocation.agent_name = "ShoppingAssistant"
    invocation.agent_id = "agent-001"
    invocation.input_messages = [
        InputMessage(role="user", parts=[Text(content="Recommend a laptop for me")])
    ]
    # ... Actually invoke the Agent framework ...  
    invocation.output_messages = [
        OutputMessage(
            role="assistant",
            parts=[Text(content="I will search for you. Please wait a moment...")],
            finish_reason="tool_calls"
        )
    ]
    invocation.input_tokens = 42
    invocation.output_tokens = 18
# ========== Tool execution ========== 
with handler.execute_tool() as invocation:
    invocation.tool_name = "search_products"
    invocation.tool_call_arguments = {"query": "laptop", "category": "electronics"}
    # ... Actually execute the tool ...  
    invocation.tool_call_result = {"products": [{"name": "MacBook Pro", "price": 12999}]}

In the preceding Code, the Developer does not directly perform an operation on any OpenTelemetry (OTel) API. Manual Creation of a Span, Settings of SpanKind, mount of the gen_ai.agent.name property, or record of Duration Metrics is not required. These are all automatically completed by ExtendedTelemetryHandler during the enter and exit procedures of Context Manager. If an exception is thrown during the invocation procedure, Handler automatically catches it and sets the error.type property and fault Status on Span. For the detailed usage procedure, you can see the References.

Currently supported instrumentation
Based on GenAI Utils, LoongSuite Python Agent has implemented instrumentation for the following GenAI frames and model services, which cover mainstream GenAI ecosystems domestically and internationally:

The core telemetry logic of these instrumentation libraries all reuses GenAI Utils for implementation. When new semantics are added to LoongSuite GenAI SemConv or specifications are adjusted, you can simply upgrade the opentelemetry-util-genai package, and all downstream instrumentation libraries can take effect uniformly.

Conclusion: From unified fields to unified infrastructure
The observability construction in the GenAI era has evolved from "adding log instrumentation for model invocations" to "establishing unified semantics for the full trace of Prompt, infer, retrieve, tools, and Agent". OTel has provided a standardized direction for this, and promotes the formation of GenAI observability capabilities through semantic specifications and instrumentation libraries.

The significance of Alibaba and Ant Group co-building the GenAI observability semantic specifications lies in further engineering, platformizing, and scaling this standardized direction. On the one hand, unified semantics are used to reduce business access costs. On the other hand, unified Data is used to drive the reuse of the observability platform, Analysis Service, and administration capabilities. The ultimate Target is not to "produce a specification document", but to enable all vendors and Users that use this set of specifications to truly achieve visibility, analyzability, administrability, and evolvability for rapidly growing GenAI applications.

Community
The publish of LoongSuite GenAI SemConv this time is just a beginning. In the future, we will continue to make efforts in the following aspects:

More agile: Quickly respond to domestic AI ecosystem demands, and continuously extend the plugin matrix.
More efficient: Provide more comprehensive multimodal processing, more Span/Metric Types, and updated semantic specifications through LoongSuite GenAI Utils.
End-to-end: Unified tracking of AI invocations and microservice invocations makes the full-trace observability of multiple Agents possible.
Collaboration with upstream: Discuss specification and implementation construction by holding regular meetings with upstream Maintainers, synchronize with upstream regularly, and contribute downstream practices back to the OpenTelemetry community.

If you are building an AI application and care about observability, you are welcome to try, provide feedback, and contribute. For LoongSuite GenAI SemConv and corresponding probe implementations, you can join the following DingTalk group for communication:

Related Links
[1] Loongsuite GenAI SemConv
https://github.com/alibaba/loongsuite-semantic-conventions-genai

[2] OTel GenAI SemConv
https://github.com/open-telemetry/semantic-conventions-genai

[3] LoongSuite-utils-genai
https://pypi.org/project/loongsuite-util-genai/

[4] LoongSuite-utils-genai
https://www.npmjs.com/package/@loongsuite/opentelemetry-util-genai

[5] Document
https://github.com/alibaba/loongsuite-python-agent/blob/main/util/opentelemetry-util-genai/README-loongsuite.rst

Add Enterprise Memory to OpenClaw, and Your Agent Finally Doesn’t Have to Ask Again

ObservabilityGuy — Tue, 26 May 2026 03:12:47 +0000

This article introduces AgentLoop MemoryStore, a fully managed, enterprise-grade memory solution designed to give AI Agents long-term, reliable memory for production environments.

Presumably every AI developer has experienced such a scenario: your intelligent Agent is finally online. Demo ran smoothly, the internal review passed smoothly, and the boss nodded his approval. After two months of hard work, the team finally pushed it into the production environment. In the first week, user feedback was acceptable. But by the second week, you receive a user message like this: "The last time I explicitly said I wanted to return it, why is your robot still asking me if I want to exchange it?" You go through the conversation log, and what the user said is true-in the last round of dialogue, the intention to return was very clear. However, Agent has no impression. Every conversation is like meeting for the first time. You suddenly realize: Agent online is only the starting point, the real key is that it must "remember". And the pain behind this is far deeper than imagined.

The First Layer of Pain: Users Would Not Like to Say It Again
This is the most direct experience of harm, but also the most silent reason for the loss of users. Users don't care about your technical architecture or which big model you use. All they know is that what they said yesterday will be repeated today. In the customer service scene, the user has already explained the order problem, the receiving address and the return request, but he has to repeat it from the beginning when he enters the line again. The experience collapses instantly and the customer complaint rate rises sharply. In the sales scene, the customer made it clear that "the budget has not been approved" before, and Agent still repeatedly pushes the quotation scheme, which will only make the customer feel that the assistant is not listening at all. In the learning scene, the next day, the system still repeatedly questions as weak items, which will only make people feel that the product is perfunctory.

Users will not complain about "your memory system is not working", they will only lose it silently, or be prepared before the next use-it can't remember what I said anyway.

The Second Layer of Pain: On the Road to Self-Study, You Have to Step on the Pits Yourself
After noticing the problem, many teams chose to develop their own memory system, only to find that the road was far more difficult than expected. Originally three weeks to complete the memory function, eventually evolved into three months of the underlying infrastructure reconstruction.

● Easy to store but difficult to recall: It is not complicated to store the dialogue history in the vector database. The difficulty is to accurately recall the "most relevant information" in the next round, rather than bringing back a bunch of invalid noise. If the retrieval quality is not up to standard, the memory will be useless, recalling five pieces of information and four pieces of interference, but will bias the model judgment.

● Only increase but not decrease, memory confusion: users prefer concise answers last month, and this month they want to explain in more detail. If the system only adds but not updates, the two contradictory information coexist, and the more dirty data they use, the more inconsistent judgments.

● Context stacking and effect reversal: Some people directly put all the history into the Prompt, which seems simple, but leads to double the token cost and slow response. The model filters valid content from redundant information, and the accuracy does not increase but decreases. Long context doesn't equal good memory, and many times it's just more expensive noise.

● Demo is smooth and production is unstable: The memory of a single machine performs well in the testing phase. In the first production phase, problems occur frequently, such as the memory of multi-instance deployment does not communicate with each other, the memory of instance destruction is lost, and the memory extraction of high concurrency slows down the main link...

The Third Layer of Pain: The Function Is Done, but I Dare Not Go Online the Main Link
This is the most hidden and most realistic pain point. The memory function can be realized technically, but after landing, the problem ensues: who will maintain the vector database? How do I troubleshoot and locate exceptions? User historical memory involves privacy. How can data isolation be ensured? Compliance requires that the memory can be traced and deleted. Can the existing scheme be supported? Will the memory assembly line drag down the entire service if the traffic surges tenfold? Before these questions are clearly answered, any prudent technical leader dares to connect the core agent to the primary link. Memory is not unable to do it, but after it is done, no one dares to be really responsible. As a result, a large number of agents in the team are in an awkward position: the functions are already available, the project is not ready, and the business is slow to deliver.

In the past few years, memory ability has almost become the most crowded track in agent infrastructure. Simply storing conversations, enabling vector retrieval, and recording user preferences are no longer scarce capabilities. What is really scarce is an enterprise-level memory system that allows enterprises to quickly access, fit business scenarios, and run stably in the production environment. This is the core problem AgentLoop MemoryStore want to solve. As a fully managed enterprise-level memory management agent, AgentLoop MemoryStore has three advantages: out-of-the-box, flexible customization, and serverless O&M-free. It is equipped with core capabilities such as multi-dimensional memory retrieval, intelligent memory update, asynchronous pipeline architecture, and hierarchical precision retrieval. It no longer asks "memory weight is not important"-the answer you already know. What it needs to solve is: why the enterprise has been slow to put the core agent online, and how this key point is completely broken.

For agents, the value of memory goes far beyond "preserving historical conversations." It determines whether the agent can upgrade from a one-time question and answer tool to a long-term collaboration partner that continuously understands users, reuses context, and deposits business experience. Without memory, each round of Agent dialogue is like a first meeting. With reliable memory, Agent can truly understand "who you are, what happened, and how to continue judgment".

For enterprises, memory is never an additional function, but a watershed of whether Agent can really be used. Does the customer service robot remember the user's last work order? Does the sales assistant remember the customer's decision-making progress and historical objections? Can the learning assistant dynamically adjust the content according to the learning progress? The core of these problems is not how personified the model is, but whether the entire memory system is sufficiently engineered, operational, and scalable.

However, to really solve these pain points, it is far from enough to rely on scattered memory functions. A complete solution designed for the production environment from access, use, operation and maintenance, and compliance is needed. AgentLoop MemoryStore starts from the real pain points of enterprises and uses a set of out-of-the-box, flexible, open, stable and reliable memory system to turn "usable" agents into "daring and easy-to-use" agents.

Out-of-the-Box: No Duplication of Infrastructure Construction, so That Memory Capabilities Directly Into the Existing Business
Many teams are not unable to make Memory Demo, but are stuck in the access cost. A self-built memory system often means that you must simultaneously process vector storage, structured storage, model invocation, asynchronous tasks, monitoring and alerting, permission isolation, and SDK encapsulation. Technically, it is not impossible, but the pace of product launch will be seriously slowed down. The first value of AgentLoop MemoryStore is not how cool the feature is, but how convenient it is:

a. out-of-the-box: you do not need to create a self-built vector database, MSMQ, or background task system. you can activate it and use it in a one-stop manner. it provides the ability to write and store raw data to long-term memory recall. Enterprise agents only need to focus on their own agent development, without the need to focus on the complex memory extraction process.

b. Multiple docking solutions: It provides a complete API and SDK for data writing and memory recall. The client can be seamlessly connected. In addition, AgentLoop MemoryStore allows you to consume trace data collected by observable probes. You only need to load the probes in the program to collect user interaction information in a non-intrusive manner without modifying the original business logic. For teams with existing memory-related code, the product is also compatible with the Mem0 API, enabling zero-cost migration. In addition, it also supports multiple access forms such as MCP Server and OpenClaw plug-ins, which can be easily integrated into various mainstream Agent frameworks, allowing existing systems to quickly have long-term memory capabilities.

c. Cross-device memory sharing: provides SaaS hosting services. Memory sharing is supported across machines, instances, and sessions. Compared with the open-source standalone memory system, AgentLoop Memory provides memory sharing across devices. In an enterprise-level agent, the agent generally runs in a sandbox for permission isolation. If the memory system is a stand-alone version, it will disappear with the destruction of the agent instance. However, based on AgentLoop Memory, the agent instance can be destroyed at any time, but the memory can be forever.

Business Scenario Example: Intelligent Customer Service
A typical customer service Agent, most afraid of is "talked yesterday, today all forget". The user explained the order problem, receiving preference and communication habits yesterday. When entering the line again today, the system started asking questions from scratch and the experience would collapse immediately. After you connect to the AgentLoop MemoryStore, the customer service team does not need to rewrite the entire memory logic. Mem0-compatible interfaces or OpenClaw plug-ins can be used to recall and write memories into existing processes. When users consult again, Agent can first see key information such as "last ticket progress", "users' common addresses" and "preferred communication methods". Naturally, answers are more continuous and manual transfer is more efficient. Compared with many open source memory solutions that are more suitable for local experiments or single-machine deployment, the SaaS-based AgentLoop MemoryStore also has a very practical advantage: memory is not tied to a single machine, but can be continuously shared among different devices, different instances, and different service nodes. If the user communicates with the Agent on the web page in the morning and moves to the mobile terminal in the afternoon, or the request is routed to another machine, the system can still continue the same memory. This cross-machine sharing capability is closer to the way enterprises operate real online services.

The focus of this type of value is not "technically achievable", but "how long the business team can use it". For many enterprises, going online as soon as a week is often more meaningful than one more concept function.

Flexible and Open: Memory Is Not Only Stored, but Also Supports Business Processing and Precise Retrieval
After solving the problem of "fast access", the next key is to make the memory really fit the business, rather than simply piling up historical conversations. Memory is prone to homogenization because many products only solve the "storage" problem, but do not really solve the "how to remember, what to remember, when to take" problem. In an enterprise scenario, memory is never a static file, but a set of dynamic assets that are updated with business changes. The core difference of AgentLoop MemoryStore is that it is open enough to "memory processing" and "memory retrieval": it supports multi-dimensional memory extraction, not only retains the original dialogue content, but also automatically extracts structured memories such as user preferences, factual information, and scene summaries, so that memories are no longer scattered chat records. At the same time, it supports the dynamic update of memory rather than a mere addition, when the user's preference changes, the system will automatically update the old memory, from the source to reduce the accumulation of dirty data. It also supports flexible custom rules, whether it is the global extraction policy of the entire memory base or the special processing rules of a single message, which can be flexibly defined according to business requirements, so that the memory fully fits your business logic. In addition, it also provides a hierarchical retrieval strategy from L1 to L3, covering basic hybrid retrieval, refined Rerank to deep Agetic Search, taking into account the response speed, recall accuracy and deep semantic understanding capabilities in all aspects. The most important point here is that enterprises do not have to accept a "black box Memory" default understanding, but can inject their own business judgment into it.

Business Scenario: Sales Assistant
The key memory in the sales scenario is often not a "customer is interested in the product", but more detailed structured information: the current procurement stage of the customer, who is the decision maker, whether the budget is approved, what objections were raised in the last phone call, and what actions were agreed next. If you just put all the chat records back into context, the cost is high, the noise is much, and the effect is not stable. A more effective way is to extract information such as "organizational structure", "business opportunity stage", "historical objection" and "next action" into renewable long-term memory, and then cooperate with hierarchical retrieval to recall only the most relevant parts in the current round. In this way, Agent gives not only a "chat" reply, but more like a sales colleague who has really followed up the customer process.

Business Scenario: Learning Assistant
In the learning scene, the more memory, the better. The system needs to distinguish between "long-term stable learning goals" and "short-term changes in knowledge mastery". For example, a user prefers video explanation at the beginning and then makes it clear that he prefers topic-driven learning. Another example is that after several rounds of practice, the old memory should be corrected instead of being kept as "weak points in learning".

AgentLoop MemoryStore supports separate processing by memory type and extraction strategy, allowing Learning Assistant to not only remember users, but also "remember changes." This improvement of the personalized experience is often more direct than simply expanding the context window.

Serverless, Elastic, and O&M-Free: Memory Does Not Act as a System Bottleneck and Does Not Add Infrastructure Burden
Memory function is easy to use, flexible is not enough, once on the production, stability and operation and maintenance costs become the key to determine whether the landing. Once Memory enters the production environment, the real test is often not "whether it can be extracted", but "whether the main link will be slowed down during high concurrency". Many solutions work well in the Demo phase, but problems will be exposed when they reach the real business traffic: synchronous extraction is too slow, call queuing, upstream and downstream timeout, resource expansion depends on manual work, and monitoring and alerting are not systematic. AgentLoop MemoryStore is designed to be "production-ready": It uses the memory pipeline architecture of asynchronous writing to process time-consuming memory retrieval in the background to minimize the impact on the main process. Relying on the data processing pipeline developed by AgentLoop, it can also perform multi-dimensional deduplication for large-scale interactive data, covering lexical deduplication, hash deduplication, and semantic vector deduplication, reducing redundant dirty data from the source. At the same time, it completely decouples the storage, calculation and retrieval modules. Each module can be expanded independently according to the actual load and can be easily adapted to the Auto Scaling capacity no matter how the business traffic fluctuates. In addition, it natively supports multi-tenant isolation, complete audit logs, and end-to-end observability to fully meet the O&M and compliance requirements of enterprises.

Business Scenario: Customer Service and Shopping Guide During the Promotion Period
When e-commerce is promoted, the pressure on customer service and shopping guide agents is usually several times or even dozens of times higher than usual. If the memory retrieval is executed in full synchronization, each dialogue has to wait for the model extraction and writing to be completed, and the latency of the main link will increase rapidly, eventually affecting the whole site experience. A more reasonable approach is to leave "the most critical recall to the user's reply" in the real-time path and put "more complex memory processing and precipitation" into the asynchronous pipeline. In this way, the Agent can respond in a timely manner without blocking the foreground service due to background memory processing. For enterprises, this is not a simple architecture optimization, but a question of whether they can stabilize service quality at critical moments.

The significance of Serverless and O&M-free is also here. What the enterprise team really wants to save is not only a few machines, but also a whole set of maintenance costs around Memory: expansion, monitoring, exception troubleshooting, task backlog, data isolation, and permission control. If you do all of this on your own, Memory will quickly go from being an "empowerment" to a "new burden."

Why AgentLoop Memory Is More Suitable for Production Environment: Not Only Can Remember, but Also Can Be Verified, Managed and Audited
The access is fast, flexible, and stable. Eventually, it must be quantifiable, controllable, and compliant before it can truly enter the core link of the enterprise. When enterprises choose Memory, they will not only look at the concept, but also look at the results. Don't look at the advertisement, look at the curative effect, whether the effect is good or not, go to Benchmark to run and see. Based on a unified Benchmark, it is the touchstone for measuring different Memory systems. In the Locomo Benchmark evaluation, the accuracy score of AgentLoop Memory reaches 84.07%. At the same time, compared with EverMemos, the recalled memory volume is 30% less. This means that it doesn't just "remember more", but gives more efficient hit results with less context overhead.

In addition to the effect, enterprises are also concerned about the long-term operation. AgentLoop MemoryStore also provides several capabilities that are critical to the production environment: in addition to the effect, enterprises are also concerned about long-term operation. AgentLoop MemoryStore also provides several critical capabilities for the production environment: it has built-in multi-tenant data isolation capabilities to meet enterprise-level security boundary requirements; it also provides complete audit logs to support the full tracking of memory additions, deletions, modifications, and checks to meet the requirements of compliance audits. It also supports comprehensive observability and cost analysis capabilities. You can easily view the latency, token consumption, request volume, and storage volume, and quickly troubleshoot problems. It also supports multiple integration methods and reduces the access threshold for different technology stacks.

In other words, it wants to deliver not just a "memory agent", but a memory infrastructure that enterprises can confidently incorporate into their core business links.

Best Practice: OpenClaw + AgentLoop MemoryStore - Low-threshold Access to Long-term Memory
To enable more teams to use reliable long-term memory, OpenClaw is further integrated with AgentLoop MemoryStore. This allows developers to quickly provide stable, reusable, and operational enterprise-level memory capabilities to existing agents without the need to build memory modules from scratch. If you are already using OpenClaw, the cost of accessing AgentLoop MemoryStore will be lower. We have packaged the integration solution as a separate npm package openclaw-plugin-agentloop-memory that, once installed and configured, can add enterprise-class long-term memory to OpenClaw without modifying the OpenClaw code itself.

Prerequisites
Before you perform the migration, make the following preparations:

■ You have an Alibaba Cloud account and have activated the AgentLoop MemoryStore service.

■ Create a Workspace and MemoryStore in the AgentLoop MemoryStore console

■ The AccessKey ID and AccessKey secret of your Alibaba Cloud account.

Installation
Execute in the OpenClaw project directory:

npm install openclaw-plugin-agentloop-memory
Configure
After the installation is complete, enable the plug-in in the OpenClaw configuration and specify the connection parameters. Typical configurations are as follows:

{
  "memory-agentloop": {
    "endpoint": "cms.cn-hangzhou.aliyuncs.com",
    "accessKeyId": "${ALIBABA_CLOUD_ACCESS_KEY_ID}",
    "accessKeySecret": "${ALIBABA_CLOUD_ACCESS_KEY_SECRET}",
    "workspace": "my-workspace",
    "memoryStore": "my-memory-store"
  }
}

The following table describes the core parameters :

■ endpoint: the API endpoint address of AgentLoop MemoryStore. Enter the endpoint address based on the region where the instance is located, for example, cms.cn-hangzhou.aliyuncs.com

■ accessKeyId /accessKeySecret: Alibaba Cloud access credential, supports environment variable injection to avoid plaintext storage

■ workspace: Name of the workspace created in the AgentLoop MemoryStore control

■ memoryStore: The name of the memory bank in the workspace.

The plug-in also provides the following optional configurations:

■ userId /agentId: used for user-level and agent-level data isolation, applicable to multi-tenant scenarios

■ autoCapture: On by default, it automatically extracts valuable information from the conversation and writes it to the memory bank.

■ autoRecall: On by default, it automatically retrieves relevant memories and injects context before each conversation starts.

■ inferOnAdd: This feature is enabled by default. Intelligent extraction is enabled when you write data to the memory. Multi-dimensional memory extraction and deduplication are automatically performed.

Capabilities provided by the plug-in
After installation, the plug-in adds three types of capabilities to OpenClaw:

■ Agent tools: three memory operation tools: registration memory_recall, memory_store and memory_forget, which are convenient for Agent to actively retrieve, write and delete memory during dialogue.

■ Automated hooks: When autoRecall and autoCapture are enabled, memory recall and asynchronous precipitation are automatically completed to reduce business code transformation.

■ CLI command: provides openclaw agentloop command line capabilities to facilitate developers to search, add, list, and delete memories directly in the terminal, and perform connectivity checks.

SDK for Python Quick Experience Demo
If you want to quickly verify the effect first, you can also experience it directly through the Python SDK:

1.Get AgentLoop Memory SDK

pip install agentloop-memory
2.Run the sample program

from agentloop_memory import Config
from agentloop_memory.client import AgentLoopMemoryClient
import os
import time
def main():
    # 1. Init memory store client
    config = Config(
        access_key_id=os.getenv("ALIYUN_ACCESS_KEY_ID"),
        access_key_secret=os.getenv("ALIYUN_ACCESS_KEY_SECRET"),
        endpoint=os.getenv("CMS_ENDPOINT", "cms.cn-shanghai.aliyuncs.com"),
    )
    client = AgentLoopMemoryClient(
        config,
        workspace=os.getenv("CMS_WORKSPACE"),
        memory_store=os.getenv("CMS_MEMORY_STORE"),
    )
    # 2. Create memory store
    result = client.create_memory_store(
        description="Example memory store",
        extraction_strategies=["FACT"],
    )
    print("create_memory_store:", result)
    time.sleep(5)
    # 3. Add memory
    result = client.add(
        messages="I live in Hangzhou and love visiting West Lake",
        user_id="user123",
    )
    print("add:", result)
    time.sleep(120)
    # 4. Search memory
    result = client.search(
        query="Where do I live?",
        user_id="user123",
    )
    print("search:", result)
    # 5. Get all memories
    result = client.get_all(
        user_id="user123",
        page=1,
        page_size=10,
    )
    print("get_all:", result)
    # 6. List memory stores
    result = client.list_memory_stores(max_results=10)
    print("list_memory_stores:", result)
if __name__ == "__main__":
    main()

Sample result

{'status_code': 200, 'headers': {'server': 'AliyunSLS', 'content-length': '0', 'connection': 'keep-alive', 'access-control-allow-origin': '*', 'date': 'Mon, 02 Feb 2026 03:27:53 GMT', 'x-log-time': '1770002873', 'x-log-requestid': '698019B5FA0F42BA63073DF6'}}
{'results': [{'event_id': '800c03bc-dc54-42de-bd07-153421f88259', 'message': 'Memory processing has been queued for background execution', 'status': 'PENDING'}]}
{'results': [{'created_at': 1770002874, 'hash': '55566d2fdec59e0a3bf8870b1cb17bfd', 'id': '019c1c65-9745-7773-92f8-189a2b4a3721', 'memory': 'lives in Hangzhou, 'score': 0.5316177221048695, 'updated_at':: updated_at': 1770002874, 'user_id': 'user_0.46264787090919 ', '74 createdy': at': 177a' 1770002874, 'user_id': 'user123'}, {'created_at': 1770002874, 'hash': '7b869aba23294ab37679c5f7e7465921', 'id': '019c1c65-990e-7381-8ba4-794867a634bd', 'memory': 'like the scenery of hangzhou', 'score': 0.4317308740071, 'updated_at': 1770002874, ''user_id': 'user12l':} 3'
{'results': [{'created_at': 1770002874, 'hash': '55566d2fdec59e0a3bf8870b1cb17bfd', 'id': '019c1c65-9745-7773-92f8-189a2b4a3721', 'memory': 'Lived in Hangzhou, 'updated_at': 1770002874, 'user_id': upered': 'user12y', {'7b869aba23294ab37679c5f7e7465921' 'user123'}, 'hash': 170002874', 'hidat ', 'hash' 'hash' 'hash' 1770002874, 'hash': '939ed9d15f907d252363fd0e2cffb9a9', 'id': '019c1c65-9ac3-7cd1-afea-1f091dcdc6fe', 'memory': 'frequent visit to the West Lake ', 'updated_at': 1770002874, 'user_id': 'user123'}], 'relations': []}

After the memory is added, the system automatically extracts and stores three key pieces of information:

■ "I live in Hangzhou"

■ "Love the scenery of Hangzhou"

■ "I often go to the West Lake to play."

When querying "Where do I live?", the system will accurately return "live in Hangzhou" and return other associated memories based on the relevance. The whole process without manual annotation, memory extraction and retrieval can be done automatically.

Summary
Today's Memory market does not lack new concepts, but solutions that can really help enterprises run agents, run stably, and run out of business value. The focus of AgentLoop MemoryStore is not to make "memory" more mysterious, but to do the three most realistic things well: to connect to the existing system faster, to fit the specific business more flexibly, and to run in the production environment more carefully. For teams that are already doing customer service, sales, learning, shopping guide and other agents, such Memory is really worth seeing and being connected to the main link.

Don't let your agents have only seven seconds of memory. Immediate access to AgentLoop MemoryStore so that data is truly deposited into reusable business wisdom:

https://cmsnext.console.alibabacloud.com/agentloop/home

LoongCollector + ACS Agent Sandbox: Build a Production-grade AI Agent Runtime Platform

ObservabilityGuy — Tue, 26 May 2026 02:56:58 +0000

This article introduces AgentLoop MemoryStore, a fully managed, enterprise-grade memory solution designed to give AI Agents long-term, reliable memory for production environments.

1.Security and Observability Challenges of AI Agents
With the rapid development of Large Language Models (LLMs), AI Agents are moving from the lab to production. From intelligent customer service to code assistants, and from data analytics to automated O&M, AI Agents are transforming how we work. However, unlike traditional applications, AI Agents possess two distinct characteristics:

● Unpredictable behavior: The same input might generate different outputs and invoke different toolchains.

● Execution capability: Agents don't just "talk"; they "act"—accessing data, invoking APIs, and executing operations.

These two characteristics present entirely new challenges.

Core challenge 1: Runtime security (What are Agents permitted to do? Who defines the boundaries?)
Consider this scenario: A customer service Agent answering a query is subjected to a prompt injection attack. It accidentally accesses another user's order information, or even triggers a refund API. This is a real-world security risk, not science fiction.

AI Agent security risks primarily stem from two areas:

1.Lack of strong isolation in execution environments

Agents require data access and tool invocation at runtime. Without strict permission controls, prompt injections or accidental triggers can lead to unauthorized access, data leaks, or unintended operations—such as an Agent bypassing security checks to access a restricted database.

2.Lack of control over external capabilities

The greatest threats often arise from the abuse of external capabilities—such as abnormal outbound calls, SSRF/intranet probing, or sensitive data persistence and exfiltration. For example, an Agent might be tasked with "checking the weather" but actually initiates a scan of internal network services.

Core Challenge 2: Full-link Observability (What did the Agent do? Why did it do it? How effective was it?)
Traditional applications are deterministic; the same input yields the same output. AI Agents, however, may make different decisions each time, leading to three major observability hurdles:

1.Behavior is hard to reproduce and troubleshoot

For the same query, an Agent might use Tool A today, Tool B tomorrow, or simply provide a direct answer the day after. When errors occur, identifying the exact point of failure is difficult.

2.Difficulty in cost control and attribution

Costs are driven by LLM token consumption and external API calls, both of which fluctuate significantly. It is often unclear which users, tasks, or models are driving up expenses.

3.Quality is hard to measure and optimize

Output quality depends on model capability, prompt design, and retrieval data. Because these factors change constantly, it is difficult to pinpoint what is working, what isn't, and how to optimize.

Why Is a Specialized Solution Necessary?
Traditional monitoring and security solutions fall short in AI Agent scenarios:

This is why a runtime platform and observability solution specifically designed for AI Agents are essential. Let's explore how ACS Agent Sandbox and LoongCollector address these challenges.

2.ACS Agent Sandbox and LoongCollector: Comprehensive Security and Observability
ACS Agent Sandbox provides a secure execution environment based on Kubernetes, while LoongCollector acts as a telemetry data collector to provide agents with comprehensive monitoring and analysis. Together, their deep integration forms a complete production-grade execution platform for AI Agents.

2.1 ACS Agent Sandbox: Providing Runtime Security
Alibaba Cloud Container Service (ACS) Agent Sandbox is a specialized environment launched by Alibaba Cloud. Built on Kubernetes, it provides a secure, isolated, and scalable platform for running AI Agents.

2.2 LoongCollector: Providing Sandbox Observability
LoongCollector is a unified telemetry collector open-sourced by the Alibaba Cloud Observability team. Designed for cloud-native and high-performance scenarios, it offers unique advantages for AI Agent use cases:

Extreme Performance and Ultra-low Overhead
AI Agents are compute-intensive, so observability components must be lightweight to avoid impacting business operations:

● Zero-copy architecture: Utilizes Memory Arena and zero-copy to minimize unnecessary memory overhead.

● Event pooling and reuse: High-frequency object pooling reduces memory allocation and Garbage Collection (GC) pressure.

● High single-core throughput: A single core can support log collection throughput of up to 500 MB/s.

Unified Collection: Full Coverage of Logs, Metrics, and Traces
● Logs: Supports stdout/stderr and file logs; automatically associates Kubernetes metadata such as Pods, Namespaces, and Labels.

● Metrics: Native support for Prometheus Exporter, system metrics (CPU, memory, network, and disk I/O), and GPU metrics (NVIDIA DCGM).

● Traces: Full support for OpenTelemetry.

Edge Computing: Moving Processing to the Data Source
Beyond collection, it performs edge-side preprocessing to reduce transmission and storage costs:

● High-performance C++ plugins and Structured Process Language (SPL) engine.

● Supports complex processing: Filtering, transformation, and aggregation.

● Edge-side dimensionality reduction: Minimizing noise and data volume at the source.

Enterprise-Grade Reliability: Ensuring Zero Data Loss and Stable Operations
Data reliability

● At-least-once delivery semantics.

● Local disk caching: Persisting data to disk during network anomalies and retransmitting upon recovery.

● Automatic retry and exponential backoff.

● Backpressure and rate limiting: Protects the system during downstream congestion.

Operational reliability:

● Multi-tenant pipeline isolation.

● Priority scheduling: Ensuring critical data is processed first.

● Hot updates and graceful changes: Configuration changes take effect without restarts or service interruptions.

Unified Management for Large-Scale Elastic Scenarios
● ConfigServer: Centralized configuration management supporting tens of thousands of Agents.

● Remote configuration delivery: Changes take effect in real-time without requiring manual login.

● Status and performance monitoring: A unified view of health and resource overhead.

2.3 Deep Integration: LoongCollector Provides Zero-Intrusion, Automated, and Highly Reliable Observability for Sandbox

● ACS management automatically injects the LoongCollector container into the Sandbox.

● Via shared file path mounting.

● Use the Pod network to perform Prometheus scraping on AI Agents or receive OpenTelemetry data.

Through the deep integration of ACS Agent Sandbox and LoongCollector, we have built a comprehensive production-grade platform for AI Agents:

3.Running OpenClaw Using ACS Agent Sandbox and LoongCollector
OpenClaw is a trending AI application that redefines the boundaries of AI assistants. Its core value is no longer just answering questions, but understanding intent, planning steps, and invoking tools to complete tasks—acting as an "always-on" digital employee. Next, let's explore how to run OpenClaw securely and with full observability using ACS Agent Sandbox and LoongCollector.

*3.1 Enabling Sandbox LoongCollector Injection for ACK and ACS Clusters
ACK clusters
*
Note: Install the following components in advance:

● Install the LoongCollector component in Components and Add-ons.

● Install the ACK Virtual Node component in Components and Add-ons.

● Install ack-agent-sandbox-controller components in Components and Add-ons.

● To expose services via EIP, install the ack-extend-network-controller component from the Marketplace. Refer to the help document for specific configuration steps.

Modify the eci-profile ConfigMap in the kube-system namespace. The slsMachineGroup parameter defines the Sandbox machine group identifier; we recommend using a unique identifier different from the ACK DaemonSet group.

ACS clusters

Note: Install the following components first:

● Go to Components and Add-ons and install the ack-agent-sandbox-controller component (version ≥0.5.3).

● To expose services via EIP, go to Components and Add-ons in the ACK cluster and install the ack-extend-network-controller component.

● Go to Components and Add-onsand install the in alibaba-log-controller component.

The machine group identifier is the unified ACS cluster group ID: k8s-log-${cluster_id}

3.2 Deploying OpenClaw in ACS Agent Sandbox
Enable the OpenTelemetry (OTel) plugin for OpenClaw

Note

● Ensure extensions/diagnostics-otel is included when packaging the OpenClaw image.

● You must enable diagnostics-otel in the configuration to report metrics and trace data.

Configure ~/.openclaw/openclaw.json

Note: The endpoint configured here will be required for the LoongCollector collection configuration later.

{  
  "plugins": {  
    "allow": ["diagnostics-otel"],  
    "entries": {  
      "diagnostics-otel": { "enabled": true }  
    }  
  },  
  "diagnostics": {  
    "enabled": true,  
    "otel": {  
      "enabled": true,  
      "endpoint": "http://127.0.0.1:4318",  
      "protocol": "http/protobuf",  
      "serviceName": "openclaw-gateway",  
      "traces": true,  
      "metrics": true,  
      "logs": true,  
      "sampleRate": 1,  
      "flushIntervalMs": 60000  
    }  
  }  
}

OpenClaw sandbox deployment example

Below is a simplified example of creating an OpenClaw sandbox directly using a Sandbox CR:

apiVersion: agents.kruise.io/v1alpha1  
kind: Sandbox  
metadata:  
  name: openclaw  
  namespace: default  
spec:  
  template:  
    metadata:  
      labels:  
        alibabacloud.com/acs: 'true'  
        app: openclaw  
    spec:  
      containers:  
        - name: openclaw  
          # Replace with the actual OpenClaw image address  
          image: <open-claw image address>   
          imagePullPolicy: IfNotPresent   
          resources:  
            limits:  
              cpu: '4'  
              memory: 8Gi  
            requests:  
              cpu: '4'  
              memory: 8Gi  
          securityContext:  
            readOnlyRootFilesystem: false  
          terminationMessagePath: /dev/termination-log  
          terminationMessagePolicy: File  
      dnsPolicy: ClusterFirst  
      paused: true  
      restartPolicy: Always  
      schedulerName: default-scheduler  
      securityContext: {}  
      terminationGracePeriodSeconds: 1

3.3 Full Observability Collection Configuration
As described in Is Your OpenClaw Really Running Under Control?, the observability data for OpenClaw is as follows:

Session logs

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-session-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        # This path varies depending on the run path of the openclaw image.  
        FilePaths:  
          - /home/node/.openclaw/agents/main/sessions/*.jsonl  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on the OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-session-log  
    sample: ''  
  # Replace this with the sandbox machine group name of the ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The project to which logs are collected.  
  project:  
    name: k8s-log-xxx  
  # The Logstore to which logs are collected.  
  logstores:  
    - name: openclaw-session-log

Application logs

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-app-log  
spec:  
  config:  
    aggregators: []  
    global: {}  
    inputs:  
      - Type: input_file  
        FilePaths:  
          - /tmp/openclaw/*.log  
        MaxDirSearchDepth: 0  
        FileEncoding: utf8  
        EnableContainerDiscovery: true  
        # Filter containers based on OpenClaw sandbox information.  
        ContainerFilters:  
          K8sPodRegex: ^(openclaw.*)$  
    processors:  
      - Type: processor_parse_json_native  
        SourceKey: content  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-app-log  
    sample: ''  
  # Replace this with the name of the sandbox machine group for your ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The destination project for data collection.  
  project:  
    name: k8s-log-xxx  
  # The destination Logstore for data collection.  
  logstores:  
    - name: openclaw-app-log

OpenTelemetry

apiVersion: telemetry.alibabacloud.com/v1alpha1  
kind: ClusterAliyunPipelineConfig  
metadata:  
  name: openclaw-otel-config  
spec:  
  config:  
    # This corresponds to the logstores below. It distributes and stores OpenTelemetry logs, metrics, and trace data.  
    aggregators:  
      - Type: aggregator_opentelemetry  
        MetricsLogstore: openclaw-otel-metrics  
        TraceLogstore: openclaw-otel-traces  
        LogLogstore: openclaw-otel-logs  
    global: {}  
    inputs:  
      - Type: service_otlp  
        Protocals:  
          HTTP:  
            # Corresponds to the diagnostics-otel Endpoint enabled in OpenClaw.  
            Endpoint: '127.0.0.1:4318'  
            ReadTimeoutSec: 10  
            ShutdownTimeoutSec: 5  
            MaxRecvMsgSizeMiB: 64  
    processors: []  
    flushers:  
      - Type: flusher_sls  
        Logstore: openclaw-otel-logs  
  # Replace with the Sandbox machine group Name for the ACK or ACS cluster.  
  machineGroups:  
    - name: <your-sandbox-machine-group>  
  # The project for Collection.  
  project:  
    name: k8s-log-xxx  
  # The Logstore for Collection. Note that OpenTelemetry has three Data Types. You must define three Logstores.  
  # For metrics Data, set telemetryType to Metrics.  
  logstores:  
    - name: openclaw-otel-logs  
    - name: openclaw-otel-metrics  
      telemetryType: Metrics  
    - name: openclaw-otel-traces

3.4 Summary: Fully Resolving OpenClaw Security Challenges
Sandbox runs OpenClaw securely and in isolation

● Each Sandbox runs in an isolated kernel environment, preventing malicious code from attacking host system programs.

● Each Sandbox uses an isolated temporary file system to prevent unauthorized reading, tampering, or deletion of host files.

LoongCollector enables full-stack observability for OpenClaw

4. Summary and Outlook
The production-readiness of AI Agents is not a matter of "if," but "how." Security and observability are not optional—they are essential requirements.

If you are building an AI agent application:

● Start now by prioritizing runtime security and observability.

● Choose the right tools instead of reinventing the wheel.

● Establish best practices and promote them within your team.

● Continually learn and optimize to ensure your Agents create real value.

Both ACS Agent Sandbox and LoongCollector are open platforms; we invite you to try them and share your feedback. Together, let's build a more secure, reliable, and efficient production environment for AI Agents. We hope this article provides valuable reference and inspiration for your observability journey.

Human-Robot Half Marathon: The Large-Scale O&M Challenge for Embodied Intelligence Beyond the Racecourse

ObservabilityGuy — Wed, 20 May 2026 02:38:23 +0000

This article introduces an Alibaba Cloud-powered O&M observability system tackling humanoid robot challenges in large-scale, outdoor, and long-distance scenarios.

A special half marathon has just concluded in Beijing. More than 300 humanoid robots competed alongside humans, vying across dimensions such as autonomous navigation, dynamic balance, and multi-robot coordination, setting a global record for the scale of human-robot co-running events. When hundreds of robots collectively run 21 kilometers, what we see is not just a race, but a large-scale public stress test for the realm of embodied intelligence. As the race ends, a bigger challenge has emerged beyond the racecourse—

In the face of new embodied intelligence scenarios characterized by clustering, mobility, and complexity, the industry urgently needs a standardized, reusable, integrated O&M system that adapts to outdoor weak-network and multi-device heterogeneous environments. Leveraging Alibaba Cloud's full-spectrum observability capabilities, with Simple Log Service (SLS), CloudMonitor (CMS), and Application Real-Time Monitoring Service (ARMS) as the core foundation, a collaborative O&M observability system for humanoid robots has been built. This system precisely matches the requirements of typical scenarios involving long-distance movement, multi-robot formation coordination, and full environment variable interference, providing a practical reference for the industry to solve large-scale O&M challenges.

Three Dilemmas: New Challenges in Embodied Intelligence O&M Observability
The 21-kilometer open course of the half marathon is an extreme stress test of the comprehensive stability of humanoid robots. It also exposes the three core bottlenecks in deploying embodied intelligence clusters at scale — a common challenge across all outdoor large-scale scenarios.

● Environmental uncertainty is the primary challenge of outdoor operations. In open scenarios, temperature, humidity, and lighting conditions change in real time, while uncontrollable factors such as road bumps, ramps, curves, pedestrian crossings, and wireless signal fluctuations persist, continuously interfering with sensor detection accuracy, communication transmission stability, and power system payload balance. Especially under high-temperature conditions, prolonged high-load operation of robot active joints, computing power modules, and battery components accelerates hardware aging and significantly increases component failure rates. Device operation remains in a state of Dynamic fluctuation, where a single environmental disturbance can trigger cascading abnormalities.

● Hidden damage and coupling threats from highly integrated devices further amplify operational risks. Humanoid robots tightly integrate motion modules, multiple sensor types, edge computing, AI inference, wireless communication, and other multilayer systems with precise structure and high interdependency. Minor vibrations and low-speed collisions during movement do not cause obvious skin damage but can easily lead to irreversible hidden issues such as slight displacement of lidar and vision cameras, loose joint wiring, and micro-deformation of internal support structures, which in turn cause navigation and obstacle avoidance inaccuracy, intermittent signal breaks, task execution bias, and other problems. Combined with individual device differences introduced by manual assembly, a minor abnormality in one device can quickly propagate to the entire formation, causing coordination disorder, rhythm desynchronization, and even cluster-level security risks.

● Traditional O&M patterns are completely unable to adapt to new scenarios. Previously, fixed devices relied on post-incident emergency repair, manual offline troubleshooting, and standalone independent management — a passive pattern with delayed response, entirely unsuitable for humanoid robots that operate with Dynamic mobility, all-weather jobs, and multi-robot collaboration. To support stable operation of large-scale clusters, it is essential to break down data silos among hardware indicators, system logs, algorithm links, and environmental data, move beyond experience-based manual O&M, and complete the transformation from passive remediation to active defense through full-dimension status visualization, proactive threat prediction, and rapid abnormal loss containment.

Cloud-edge Collaborative Data Collection Adapted to the Core O&M Features of Humanoid Robots
Based on the natural properties of humanoid robots — large-scale movement, unstable network environments, multi-brand heterogeneity, and long-duration continuous operations — the ideal O&M architecture for the industry must balance low-latency edge self-healing with cloud-based global unified management. By adopting a Layer 3 cloud-edge collaborative design spanning terminal body, edge gateway, and cloud platform, the solution reasonably separates the responsibilities of data collection, local management, computing power processing, and global analysis. Built around the three core O&M modules of real-time status monitoring, intelligent failure prediction, and hierarchical emergency response, Alibaba Cloud observability products form a complete capability matrix integrating indicators, traces, and logs to address industry pain points such as fragmented embodied device logs, difficulty in quantifying hardware indicators, and difficulty in troubleshooting hidden algorithm faults.
At the data access layer, the solution provides two highly available and flexible deployment modes to adapt to different outdoor conditions and network environments.

● The lightweight LoongCollector and Simple Log Service software development kit direct collection mode features extremely low resource usage on the device side and high compression and transmission efficiency. It meets high real-time monitoring requirements and supports dynamic adjustment of collection policies from the cloud, eliminating the need for frequent OTA upgrades on devices. LoongCollector is a new-generation Database Collector launched by Alibaba Cloud Simple Log Service that integrates performance, stability, and programmability. It extends and integrates the observability technology stack, breaking the single-scenario limitations of traditional log collectors, and supports the collection, processing, ingress, and sending of Logs, Metrics, Traces, Events, and Profiles.

● Based on the S3 protocol + Simple Log Service architecture, this mode is suitable for weak network and intermittent connectivity scenarios. Data is cached and encrypted locally and uploaded during off-peak hours. It is low-cost, highly reliable, not attached to a single vendor, and more extensible.

Both modes are fully compatible with 5G, Wi-Fi, IoT, and other communication methods, fully adapting to the complex and dynamic network environment of mobile robots.

Full-Domain, All-Dimension Observability for a Transparent Robot Cluster Operation System

Whether for outdoor formation movement or routine commercial deployment, the foundation for stable operation of large-scale embodied intelligence clusters lies in full-dimension, full-epoch, and full-link observability.

● At the hardware level, core indicators such as joint motor payload, current temperature, power supply health status, compute unit resource usage, inertial navigation calibration accuracy, sensing device data streams, sensor readings, and network quality are continuously collected to fully grasp the health status of core components and detect hardware threats such as overload, overheating, abnormal power supply, and sensor attenuation in advance.

● At the business and algorithm level, the running status of underlying core processes is monitored in real time, and various management events are managed at different levels, with a focus on intercepting faults and fatal exceptions. Key indicators such as perception and decision inference latency, path planning efficiency, and collaborative execution success rate are continuously tracked to fully restore algorithm running health and detect performance degradation and logical exceptions in a timely manner.

● At the scenario and environment level, full-epoch job info, device running status transitions, outdoor temperature and humidity environment data, physical collision management events, and other real-scene information are recorded. Through multi-dimension data cross-referencing, different failure root causes such as environmental interference, mechanical damage, algorithm bugs, and human operations are quickly distinguished, providing an objective basis for daily O&M and post-event review.

For the above observation scenarios, the three core dimensions of indicator monitoring, Tracing Analysis, and log administration are built in depth to form a full-coverage, strongly collaborative, and closed-loop global observability capability, targeting industry pain points such as invisible operation of embodied devices, difficulty in detecting exceptions, and difficulty in tracing failures.

● Indicator monitoring focuses on the model training realm, covering full-dimension timing monitoring and visualization management of AIBoost cluster AI infrastructure. Through continuous statistics on training resource payload, hardware conditions, environment parameters, and cluster running status, the training procedure can be quantified and abnormal threats can be warned in advance, ensuring the stability and reliability of AI model iteration from the ground up.

● Tracing Analysis provides deep, end-to-end visibility into service operations, enabling full-link visualization and tracing across the CDN mapping system, motion control services, AI inference links, and cross-device interface interactions. It accurately captures hidden application layer failures such as algorithm drift, background service stuttering, remote instruction blocking, and multi-machine collaborative scheduling conflicts, making previously invisible software and algorithm issues fully transparent and significantly improving the efficiency of troubleshooting soft abnormal issues.

● Log Administration: provides unified collection and standardized administration of end-to-end logs, including hardware operational logs, system process logs, AI module operation records, edge node management events, and job operation traces. It effectively addresses the challenges of scattered logs from heterogeneous devices, inconsistent formats, fragmented data, and difficulty in correlating and tracing issues. With high-throughput ingestion and second-level retrieval capabilities, it delivers complete, objective, and verifiable data support for failure review, root cause analysis, accountability determination, and batch issue tracing.

With global visualization and management capabilities, you can gain a macro-level view of overall cluster status, device online status, and overall payload fluctuations, while also drilling down into individual device details, achieving bidirectional integration between macro management and micro-level positioning. Combined with dynamic thresholds and intelligent anomaly detection, real-time alerts are triggered for high-frequency threats such as sudden power drops, high-temperature overloads, network disconnections, and data drift, enabling true proactive threat prevention and control.

Multi-Field Dependency Analysis to Resolve Incremental Hidden Threats with Predictive O&M
Compared with obvious hardware corruption, the slow attenuation of sensor accuracy, line contact fatigue, chronic component aging, algorithm performance degradation, and hidden structural hazards caused by long-term vibration are the key factors affecting the long-term stable operation of humanoid robots. Such progressive issues cannot be detected through manual inspection and require multi-source data field dependency analysis to implement data-driven predictive O&M.

Leveraging full-volume timing indicator data, this capability accumulates long-term insights into basic resource O&M, model training and inference efficiency evaluation, device payload changes, environmental impact patterns, and hardware aging trends to form a quantifiable health assessment baseline. Through end-to-end Tracing Analysis, the complete flow logic of instruction routing, service invocation, and algorithm computation is fully restored to quickly locate coordination bottlenecks and program anomalies. Combined with unified log administration, system events, error records, environmental changes, and external interference before and after an anomaly are correlated to fully reconstruct the failure scene.

Multi-dimension data association and cross-validation enable accurate discovery of potential patterns in device operation and early detection of hidden risks. Combined with a tiered alerting mechanism that filters invalid fluctuations and duplicate alerts, threats are escalated and handled by tiering. During the early stage of failure emergence, proactive intervention through parameter automatic rotation tuning, run policy optimization, and remote fine-grained control effectively extends the stable operation epoch of devices, reducing failure rates and burst maintenance costs at the source.

The deeper value of observability goes beyond ensuring current stable operation — it uses data from real, complex scenarios to feed back into product R&D and process upgrades, paving the way for long-term commercialization of humanoid robots. By leveraging comprehensive data accumulation, you can horizontally compare operational differences across devices of the same model and batch, quickly identify common issues caused by component batch bugs, schema design shortcomings, and manual assembly process bias, and help manufacturers optimize supply chains and production flows. Through quantitative analysis of algorithm performance, component payload, and sensing stability under different operating conditions, hardware limitations and algorithm bottlenecks are precisely distinguished, helping R&D teams optimize motion control, autonomous navigation, and coordination policies in a targeted manner.

Meanwhile, massive scenario data such as real road conditions, crowd interference, complex lighting, extreme temperature and humidity, and collision anomalies can continuously enrich the simulation training sample library, narrow the gap between the simulation environment and real outdoor scenarios, accelerate algorithm iteration and real-machine adaptation efficiency, and enable humanoid robots to move faster from competition demonstration scenarios to normalized, large-scale deployment.

Tiered Closed-Loop Emergency Response System for High Fault Tolerance Operation Assurance in Complex Scenarios
Open outdoor scenarios inherently involve uncertainty. Instantaneous environmental changes, accidental mechanical disturbances, and short-term network anomalies cannot be completely eliminated. A standardized, tiered, and automated emergency response mechanism is the key line of defense for ensuring continuous and stable cluster operation. Based on the business characteristics of multi-robot formation operation, a comprehensive three-level failure handling logic is established: minor individual anomalies, local coordination failures, and systemic major failures. O&M resources are reasonably allocated through tiered control to avoid excessive response or delayed handling.

When an abnormal event occurs, leverage the observability system to quickly locate the root cause: troubleshoot algorithm and schedule issues through business trace analysis, pinpoint the scope of hardware, power supply, and network anomalies using timing indicators, and restore the complete on-site context with full logs, significantly reducing failure troubleshooting and fix time. After each abnormal event is handled, the complete failure timeline, alerting records, root cause conclusions, and handling reports are automatically accumulated and archived. This not only forms an O&M closed loop, but also builds reusable practical experience for optimizing handling policies and iterating management rules for similar scenarios in the future.

Summary and Outlook
The Beijing Yizhuang Humanoid Robot Half Marathon vividly demonstrates the rapid rise of China's humanoid robot industry and clearly signals that clustering, outdoor operation, and scenario-based deployment are the inevitable direction for the future development of embodied intelligence. As hardware integration and AI algorithms continue to break through, O&M capabilities are becoming a key variable that widens the industry gap. Multi-robot collaboration, hidden threat prevention, and full lifecycle management in open and complex environments are common challenges that all humanoid robot companies must address.

Alibaba Cloud's full-domain observability solution for embodied intelligence, built on a cloud-edge collaboration architecture, integrates three core capabilities: indicator monitoring, Tracing Analysis, and log analysis. It fully addresses the scenario features of humanoid robots, including mobile operations, cluster formation, weak network adaptation, and long-duration runs. Rather than being limited to a single event application, it provides a mature, standardized, and replicable O&M capability frame for similar outdoor cluster, dynamic operation, and large-scale deployment scenarios across the industry.

In the future, as the mass production scale of humanoid robots continues to expand and application scenarios keep extending, data-driven artificial intelligence for IT operations, proactive predictive protection, and full-link observability systems will become the core foundation for high-quality development of the embodied intelligence industry, continuously helping China's humanoid robot technology advance from technical demonstration to full-scale commercial deployment.

Related Products
Simple Log Service: https://www.alibabacloud.com/en/product/log-service
CloudMonitor: https://www.alibabacloud.com/en/product/cloud-monitor

Put a Microscope on Hermes: Full Visibility into Agent Execution

ObservabilityGuy — Wed, 20 May 2026 02:26:18 +0000

Alibaba Cloud's OpenTelemetry-based observability plugin brings full visibility to Hermes AI agent execution, enabling traceable costs, performance, and security auditing.

Hermes is an autonomous AI agent runtime frame developed by Nous Research. Rather than a one-shot Q&A pair-style model encapsulation, it is an agent runtime that continuously runs, invokes tools, accumulates experience, and grows throughout the usage procedure.

When an AI agent truly starts solving a problem — whether it completes correctly or exhibits bias — the real challenge is often not whether the result is right, but what exactly it did.

A single run of Hermes is not an ordinary model invocation. A seemingly simple interaction may involve multiple rounds of inference, tool calling, result reinjection, context expansion, and new inference loops. The model decides whether a tool is needed for the next step, and tool results in turn affect the subsequent inference path. Cost, latency, and faults often occur in the middle of this procedure.

If the system can only provide a final reply, a few scattered logs, or a usage summary for a single invocation, Hermes remains a black box. You know it completed the job, but you can hardly tell how. You know the request consumed a lot of tokens, but you can hardly tell which step drove up the cost. You know the user experience has slowed down, but you can hardly determine whether model generation slowed, tool execution went abnormal, or ReAct (Reasoning + Acting) loops spiraled out of control.

This is exactly our starting point for building observability into Hermes.

This article introduces a set of observability plugin solutions provided by Alibaba Cloud for Hermes. It can revert the real execution procedure of Hermes into a structured invocation chain: where a session starts, how many rounds of inference it goes through, which tools are invoked, how many tokens are spent, which step is the most time-consuming, and at which edge zone a fault occurs. Which operations are malicious, and how much sensitive data has been leaked.

If you are using Hermes for real-world jobs, you will almost certainly encounter these problems:

● Why is it so expensive this time?

● Why is it so slow this time?

● Did it actually invoke that tool?

● Did the tool it used leak data?

What these problems have in common is that they are not "results" but "procedures". So, if we can only see the last reply, then from an observational point of view, Hermes is still not interpretable.

What Exactly Are We Trying to Solve
The Alibaba Cloud Hermes observability plugin focuses on solving the following four types of problems.

The first is that the procedure is invisible.

After integrating an LLM, many systems still only show user input, final output, and a usage summary. But the real run of Hermes is far more than that. Behind a single response, there may be multiple rounds of inference, multiple tool executions, continuous context expansion, and new inference loops. Without a call chain, the intermediate procedure is essentially empty. The first thing we did was fill in that gap.

The second is that costs are not attributable.

The token bill itself isn't the hardest problem — the hardest part is not knowing where the money actually goes. A Hermes run can be expensive because the context suddenly explodes in a certain round, a tool returns an oversized result, the final round produces overly long output, or a certain class of jobs naturally triggers more steps. Without visibility into the tokens for each round of model invocation, cost analysis is nothing more than guesswork.

The third category is that performance cannot be broken down.

Users will only tell you "it's getting slower," but "slow" by itself carries no useful info. What you really need to distinguish is: is the first token slow, or is overall generation slow? Is tool execution slow, or is multi-round ReAct inference itself running too long? Only by separating these stages can a "slowdown" become a problem you can actually pinpoint.

The fourth category is that results cannot be reviewed.

Often the hardest issues to deal with are not clear-cut faults, but cases where "it looks like it succeeded, but the result is wrong." This is very common in agent systems: Hermes invokes the wrong tool, the tool returns incomplete results, Hermes continues to infer based on partial info, and ultimately produces an answer that seems reasonable on the surface but has already gone off track. Without traces, post-mortem review is nearly impossible. With traces, the problem shifts from "guessing the cause" to "examining the path."

What We Did
What we built for Hermes is a set of OpenTelemetry (open telemetry frame)-based Tracing Analysis capabilities.

The core idea is straightforward: install runtime instrumentation in the Python environment where Hermes runs, establish spans around the key execution borders of Hermes, and then report traces and indicators to the observability backend through OTLP (OpenTelemetry Protocol), a standard protocol.

Our focus is not on "what the last row of reply looks like", but on the running procedure of Hermes itself.

This Solution Has Several Advantages Worth Highlighting
It is worth mentioning that this set of plugins is not a temporary instrumentation script thrown together, but is designed along the OpenTelemetry system.

First, it follows the GenAI standard specification as closely as possible at the semantics layer. The currently reported trace data preferentially snaps to the OpenTelemetry GenAI semantic conventions. For structures in the Agent runtime that are closer to the execution procedure, extensions are made in combination with LoongSuite Semantic Conventions. Instead of defining a batch of field names that can only be understood internally, we try to use a set of standard, reusable, and portable semantic expressions. In other words, this is not a makeshift approach, but a well-structured observability design that follows industry best practices.

Second, it provides not only traces but also basic metrics signals. In addition to the call chain of a single request, you can also view trends such as the number of invocations, number of faults, invocation duration, and token usage. This way, you can replay a single request along a trace, or observe cost fluctuations, performance changes, and abnormal trends from a global perspective.

Third, it records time to first token (TTFT) separately for streaming scenarios. In many cases, when users perceive something as "slow", it is not necessarily that the entire generation is slow, but rather that the first token takes too long to return. With TTFT, performance issues can be further broken down from "feels slow" into "slow first token" or "slow overall generation".

Fourth, it is not attached to a single Alibaba Cloud service on the backend. The current solution can be directly connected to Alibaba Cloud ARMS, but it uses the OTLP standard protocol underneath and is not designed to be locked into a private data structure. Connecting to ARMS works today, and if you need to connect to other OTLP-compatible backends in the future, migration space is preserved.

Fifth, it supports security audits of important behaviors in Hermes. By collecting full operation logs, access records, and user behavioral data from the Hermes system, and combining outlier detection algorithms to build a dynamic audit model, it can accurately detect suspicious behaviors such as unauthorized access, abnormal data exporting, and malicious prompt injection.

What Can Already Be Seen
The observability capability of the current version of Hermes can revert a real agent run into a ReAct structured trace.

The core pipeline is as follows:

invoke_agent Hermes  
└── react step  
    ├── chat   
   └── execute_tool <tool_name>

If a job contains multiple rounds of inference and multiple tool calls, the pipeline naturally expands:

The significance of this pipeline is not that there are more spans, but that the actual execution of Hermes becomes visible for the first time.

How many rounds an execution ran, which round triggered the tool, and how the tool affected subsequent inference — all of this can now be viewed in the same trace.

Call a Model
Each chat span can currently record:

● gen_ai.request.model

● gen_ai.usage.input_tokens

● gen_ai.usage.output_tokens

● gen_ai.usage.total_tokens

● gen_ai.response.time_to_first_token

This means we can finally view tokens and latency per "actual model invocation" instead of only looking at the aggregate of an entire session. Especially in streaming scenarios, TTFT (time to first token,first-token latency) can help us further distinguish whether the first token is slow to return or the overall generation procedure is slow.

Tool Calling
Each execute_tool span can currently record:

● gen_ai.tool.name

● gen_ai.tool.call.arguments

● gen_ai.tool.call.result

Tools are no longer empty edge zones in the procedure. We can see when Hermes decided to invoke a tool, which tool was invoked, what parameters were passed, and what results were returned.

Agent-Level Summary
The root vertex invoke_agent Hermes span can now record the aggregation results of the entire run, including:

● Cumulative Token

● Final output message

● Total time consumption info

Important Behavior Audit
Records agent behavior across the full chain, intelligently generates audit views, and exposes high-risk operations.

Quick Observability Integration: Deployment in a Few Steps
The integration path for Hermes observability is streamlined into a straightforward flow: get the command from the console, copy it to the terminal and execute it, enable the plugin, start Hermes, and begin reporting.

Tracing Integration
Go to the console to obtain the installation command
Log on to the CMS 2.0 (Cloud Monitor Service 2.0) console, go to the corresponding application monitoring workspace, choose Integration Center > AI Application Observability, and click Hermes.

In the sidebar, enter the application name and click Get to immediately generate the integration command. Click the icon in the upper-right corner to copy it with one click.

One-line command to start installation
Open the terminal on the machine where Hermes is located, paste the copied command, and execute it:

curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/hermes-agent-cms-plugin/hermes-cms.sh | bash -s -- install \  
  --x-arms-license-key "auto" \  
  --x-arms-project "Your project" \  
  --x-cms-workspace "Your Workspace" \  
  --serviceName "hermes" \  
  --endpoint "https://Your ARMS-OTLP address/apm/trace/opentelemetry"

When you execute the installation command for the first time, in addition to installing the plugin itself, the system also registers the hermes-cms command on the local machine for subsequent operations such as enable, disable, and uninstall.

If the following message appears in the terminal, the plugin has been installed successfully:

════════════════════════════════════════════════════

✅ hermes-agent-cms-plugin installed successfully!

════════════════════════════════════════════════════

Throughout the procedure, you do not need to manually edit the configuration file. The script will first match the current environment. Only when the current environment does not meet the requirements will it resume trying the official default installation position.

Turn on observability, and then start Hermes
After the installation is complete, don't rush to check the console.

The first step is to turn on the observability switch:

hermes-cms enable

Then start Hermes.

To run in the foreground, execute directly:

hermes

Run executable in background:

hermes gateway install

hermes gateway start

How to confirm that instrumentation is actually working
If the following tooltip appears in the terminal after startup, the observability instrumentation has taken effect:

loongsuite-site-bootstrap: started successfully (OpenTelemetry auto-instrumentation initialized).

After confirming that the instrumentation has taken effect, send a few test requests to Hermes to run a real job that triggers multiple rounds of inference and tool calling. After a minute or two, return to the CMS 2.0 console, and you will see your Hermes application in AI Application Observability.

At this point, Hermes is no longer just a black box responder — it becomes a running system that can be expanded, tracked, and analyzed.

Enter our observability application to view not only the number of Hermes model invocations, token consumption trends, request fluctuations, and the average number of LLM invocation rounds per request, but also the latency and invocation distribution across AGENT, LLM, and TOOL phases. You can also trace a complete Trace to revert the actual execution procedure of Hermes, clearly seeing how many rounds of inference a job went through, which tools were invoked, which step took the longest, and which round consumed the most tokens.

View the demo examples and the hermes_agentloop_support example at https://sls.aliyun.com/doc/en/playground/cmsdemo.html

Want to shut down or uninstall? It's straightforward.
To temporarily shut down observability, execute:

hermes-cms disable

To completely uninstall the plugin, execute:

hermes-cms uninstall

Log Ingestion
Configure application info on the access Card
Next, click the "Log Access" page, set a custom application name, click Initialize Resources, enter the previously configured Project name, and configure the machine group as prompted to complete the Hermes Audit Feature with one click.

Auto-generated Audit dashboard
After the access is complete, in the left sidebar, choose Audit > Hermes Insight > Hermes Audit to view the audit dashboard of your Hermes agent.

Summary and Outlook
This solution can reliably address Tracing Analysis, token attribution, and basic performance breakdown, while also providing basic metrics signals for trend analysis. However, this does not mean that all observability work for Hermes is complete.

Next, we will continue to push forward in several directions.

● On the data plane, continue to expand from traces, span properties, and basic indicators to more complete log audit and runtime diagnostics capabilities.

● On the link plane, continue to refine Hermes-specific execution phases beyond agent, step, llm, and tool, such as memory lifecycle, delegation orchestration, and runtime recovery.

● On the governance plane, continue to strengthen content collection control, finer-grained data governance capabilities, and unified desensitization and security policy development.

Today, we already have an active runtime observability infrastructure, and the next goal is to further evolve it into a more complete, more detailed Agent observability system that is better suited for real production environments.

From Observable to Understandable: Building Agent-Native Code Knowledge Graphs with UModel

ObservabilityGuy — Mon, 11 May 2026 06:57:40 +0000

UModel builds agent-native code knowledge graphs using deterministic AST parsing and cross-domain associations for deeper AI code understanding.

Background
In recent years, AI agents (Cursor, Copilot, Claude Code, Codex, etc.) have become deeply involved in software development. From code completion to cross-file refactoring, from bug localization to architecture design, agent capabilities are growing stronger. From Prompt Engineering to Context Engineering to Harness Engineering, the ways to harness AI continue to evolve, and the capability boundaries of agents continue to expand.

However, when we hand a real enterprise-level project to an agent, an overlooked question begins to surface: Does the agent really understand your project?

The way agents currently understand code is diverging into two distinct schools:

● No-index school: Claude Code follows the Unix philosophy and performs no pre-indexing at all — it searches the file system in real time using grep, rg, and glob. Anthropic's internal tests found that agentic search outperforms retrieval-augmented generation across the board, by a lot. It is concise, real-time, and free of privacy issues, but each session starts from scratch and is costly for large repositories.

● CodeIndex School: Cursor, Windsurf, and Copilot follow the vector index route: using tree-sitter for semantic text segmentation, generating embeddings and storing them in a vector database (such as Turbopuffer), then using Merkle tree for incremental synchronization. Qodo and Augment Code go a step further by overlaying a code dependency graph and commit history index on top of the vector index.

Both schools have their own strengths, but they still struggle with the following problems:

● I want to change the Adapter interface of pkg/a2a. What is the scope of impact?

Vector similarity search cannot find the dependency chain, and grep-based file-by-file search is inefficient and incomplete.
● In production, the vibeops-xxx SLO has been breached with a large number of pending requests. What is the cause? Is it a code change?

The code index only covers the code domain; O&M domain data is not in the graph.
● Are there any abnormal dependencies in the project that cross architecture borders?

Without architecture level modeling, crossing borders cannot be defined.
What these problems have in common is that they require deterministic structural relationships, cross-domain entity associations, and change history across the time dimension.

The author has been working in the observable field for more than ten years, reviewing the development of observable, especially with the increasing complexity of cloud native and AI native systems, observable has long faced not only "looking at a log and staring at a monitoring chart", but also putting the scattered objects such as applications, services, containers, databases, alarms, changes and events back into the same context, answer "who is related to whom", "how the impact is spread" and "when did the problem begin to occur".

Because of this, Alibaba Cloud can observe the gradual evolution from the collection and display of scattered data such as logs, indicators, and links to the unified modeling of object-oriented, relationship, and time series. UModel is precipitated under this practical background.

This is strikingly similar to the trajectory of the observability realm: from viewing logs to unified modeling, observability evolved from fragmented data to the UModel knowledge graph. Yet code understanding, even with the most advanced CodeIndex solution, remains at the stage of helping agents find relevant snippets — the snippets are found, but the structure is not understood.

Five Paradigms of Code Understanding
Before diving into the technical solution, it is necessary to clarify the complete landscape of current code understanding. The five paradigms represent the evolution from stateless search to stateful inference.

Paradigm 1: Agentic Search (Claude Code School)
Claude Code is currently the most extreme index-free route. Anthropic founding engineer Boris Cherny publicly shared the story behind this decision: early versions of Claude Code used retrieval-augmented generation + a local vector library, but internal tests found that agentic search won comprehensively — by a lot, and this was surprising.

Its approach is pure to the point of elegance:

Agent receives a question  
  → Glob: pattern matching by file name (near-zero token cost)  
  → Grep (ripgrep): regex search by content (low token cost)  
  → Read: read the complete file (high token cost)  
  → Evaluate → next round of search or provide an answer

Tools are tiered by token cost, and the agent independently determines the search policy — like an experienced developer using rg + cat in the terminal to troubleshoot issues. This Unix-philosophy method has several real advantages:

● Zero pre-processing: no index build time required — open the project and start working immediately

● Always Fresh: No index expiration issues. Every search reflects the real-time file system status.

● Privacy-Friendly: Code never leaves your local machine — no embeddings are generated, and nothing is uploaded to any server.

● Simple and Reliable: The dependency chain is extremely short: Agent + file system + ripgrep. No vector database to crash.

But the ceiling of this approach is equally clear:

● No Structure Awareness: rg HandleRequest can find all occurrences, but cannot distinguish definitions from invocations or comments. The Agent has to read the code itself to determine this.

● Start from Scratch Every Time: Dependencies analyzed in the previous session are entirely discarded in the next. There is no persistence of accumulated knowledge.

● Limited scale: A TypeScript project with 200 files is fine, but for an enterprise-level monorepo with 50,000 files, agentic search may require 30+ rounds of tool calling and tens of thousands of tokens to piece together a global dependency graph. In practice, it is impossible to construct a complete global graph — only partial views relevant to the current job can be assembled.

● Unable to perform global analysis: Cannot answer "list all invocations across architecture levels" because the architecture levels themselves have not been modeled.

Paradigm 2: CodeIndex / Vector Index (Cursor, Windsurf, and Copilot School)
This is the mainstream technical approach of current AI IDEs. Taking Cursor as an example, its technical architecture has been extensively analyzed in public:

Code Repository  
  → Parse into AST with tree-sitter  
  → Segment by semantic unit (function, class, logic block)  
  → Generate vector embedding  
  → Store in Turbopuffer vector database  
  → Merkle Tree tracks changes for incremental synchronization

Cursor has achieved several elegant optimizations in engineering: it uses Merkle Tree root hash comparison to detect changes every 10 minutes and only re-embeds changed files; 92% codebase similarity among team members allows index reuse, reducing the initial indexing for new members from minutes to seconds; the index scope is controlled via .cursorignore.

Windsurf (Codeium) uses a similar retrieval-augmented generation architecture: 768-dimensional vector embedding + proprietary M-Query retrieval, but additionally overlays the Cascade context engine to track edit history, terminal commands, navigation patterns, and other session states. GitHub Copilot achieved sub-second semantic search indexing in March 2025.

The real value of CodeIndex is semantic search: the agent can find relevant code by describing intent in natural language without knowing the exact function name. This is something grep cannot do.

But CodeIndex has a fundamental limitation: vector similarity is text-level approximate matching, not structure-level relational reasoning.

● import pkg/a2a is a deterministic dependency in code, but in vector space it is merely a similarity signal of a text segment.

● Finding all modules that directly or indirectly depend on pkg/a2a requires graph traversal, not AISearch.

● Determining how many hops the impact of this interface change propagates along the invocation chain requires deterministic call relationships, not semantic similarity.

● Augment Code's evaluation shows that Cursor produces inconsistencies in cross-file refactoring across 50+ files: the first 30 files are modified correctly, but the last 20 contain faults due to context window overflow.

CodeIndex is essentially a smarter search engine: it helps agents find the correct snippets to insert into the context, but does not perform structured inference for agents.

Paradigm 3: Code Graph + Retrieval-Augmented Generation Hybrid (Qodo and Augment Code School)
Qodo and Augment Code represent the next evolutionary direction of CodeIndex: layering code structure graphs on top of vector indexes.

Qodo's technology stack is particularly rigorous:

● Self-developed Qodo-Embed-1 code embedding model (1.5B parameters surpassing 7B competitors on the CoIR benchmark), capturing syntax, variable dependencies, control flow, API usage, and other code-specific semantics through synthetic data training

● Client-side code graph building: functions, classes, modules and their call graphs, inheritance relationships, and cross-language links

● Server-side maintenance of vector database + design documents + architecture diagrams + PR/commit history

● AST-aware segment policy: recursively chunk AST edge zones and backfill key contexts such as import statements and class definitions

Augment Code 's Context Engine goes even further:

● Semantic index across repositories to understand how services connect and depend on each other

● Index beyond Code: commit history (why changes were made), codebase patterns, external documents, tickets, and even tribal knowledge

● Released Context Lineage in 2025 to index commit histories and diff summaries, enabling agents to understand the evolution of architectural decisions

● Open to any compatible agent via MCP protocol, with benchmarks showing 30–80% quality improvement

The key advancement of this school of thought is that code is not just text, but a structured graph. Augment, in particular, demonstrates the insight that understanding requires context, and context requires history.

However, even the most advanced code graph + retrieval-augmented generation hybrid solution still has several systemic borders:

● The graph scope is limited to the code domain: It knows that A invokes B, but not what alerts the service corresponding to B has triggered in the production environment. The code graph and the O&M graph are disconnected.

● Limited graph query capabilities: Graphs serving retrieval-augmented generation typically support neighbor lookup and short-path queries, but do not support arbitrary-depth graph traversal, pattern matching, or aggregation and analysis.

● IDE-local, not team-global: The index is attached to a developer's IDE instance. Structural insights analyzed by one person cannot be directly reused by another.

● Lack of a standardized timing dimension: Augment's Context Lineage has started incorporating commit history, but build logs, deployment logs, test logs, and event logs — these complete temporal memories are not yet in the graph.

Paradigm 4: CodeWiki / LLM Document (DeepWiki School)
DeepWiki (GitHub 15.7k stars, produced by the team behind Cognition AI / Devin) represents another approach: Code Repository → LLM → polished Wiki document. Simply replace github.com in the URL with deepwiki.com to see the automatically generated architecture diagrams, module documents, and function annotations.

This provides an excellent experience for developers to quickly understand unfamiliar projects. DeepWiki also supports controlling the generation scope through the .devin/wiki.json configuration file, and provides tool interfaces such as ask_question, read_wiki_structure, and read_wiki_contents via the MCP Server.

But documents are essentially linear narratives optimized for human reading:

● Hard to authenticate: Descriptions generated by LLMs may hallucinate, and in code understanding, an incorrect "A invokes B" is more dangerous than no information at all.

● Hard to traverse: Documents cannot answer graph traversal queries such as "list all functions that invoke X."

● Difficult to infer: Multi-hop analysis is not supported: if A is changed, following the calls relationship for 3 hops, which entry points are affected?

● Difficult to maintain: Changing a single line of code requires full regeneration. Although DeepWiki supports badge-triggered auto-refresh, each time it invokes a full LLM call, resulting in high cost and latency.

● Not programmable: The MCP interface essentially asks a document a question, rather than executing a query on the graph.

The relationship between CodeWiki and CodeIndex is similar to the relationship between materialized views and DPI engines in the database realm: documents are precomputed views that answer preset questions quickly, but cannot answer ad-hoc queries outside the view.

Paradigm 5: Code Knowledge Graph (Our Choice)
The five paradigms can be arranged along a single axis: from "stateless search" to "stateful inference".

If Agentic Search is each on-site survey, CodeIndex is surveying with a high-definition map, Code Graph + retrieval-augmented generation is a map annotated with highways and railways, and CodeWiki is a commissioned local chronicle: then what we want to build is a living GIS system: you can query the path between any two points, overlay real-time traffic data, annotate the traffic history of each road, continuously update as the terrain changes, and support storage analysis in any dimension.

The key difference is not better search, but a systematic combination of three dimensions:

1.Deterministic vs. Probabilistic: CodeIndex gives you the most likely relevant snippets (vector similarity). Code Graph gives you structural relationships parsed from the AST (but query capability is limited by the retrieval-augmented generation frame). We give you deterministic AST fetch + SPL/graph-match arbitrary query: confidence level 1.0 relationships + a Turing-complete query language.

2.Code domain vs cross-domain: From Agentic Search to Code Graph + retrieval-augmented generation, all solutions stop at the code domain. Which functions does this module invoke: answerable. How many alerts did the production service corresponding to this module have last week: unanswerable. UModel's EntitySetLink can connect code.module to ops.service, event.alert, and req.issue. The agent infers along the link without needing to jump out of the graph.

3.Snapshot vs timeline: CodeIndex is a snapshot index of the current code. Code Graph is starting to incorporate commit history. We provide a complete time dimension: commit_log, build_log, deploy_log, test_log, and incident_log. Each LogSet is associated with an EntitySet through DataLink. The agent not only knows what the current structure is, but also how it evolved to this point and how it performs in production.

From Personal Wiki to Code Wiki: One Paradigm, Different Certainty

The personal Wiki flow is: source data → LLM extracts entities and relationships → snap and normalization → UModel structure layer → Wiki pages. The entire extraction procedure depends entirely on the LLM, so each relationship is inherently uncertain: Are Zhang Cheng and Yuan Yi the same person? Is this article related to that project? Both require LLM judgment and correction by the snap layer.

There is one fundamental difference in the code realm: the structural relationships of code are deterministic.

import pkg/a2a imports pkg/a2a, and func (s *Server) HandleRequest() is a method of the Server class: these do not require LLM inference — AST parsing can determine them with a confidence level of 1.0.

This means that code wikis can introduce a model layer deterministic guarantee on top of the personal wiki paradigm:

Personal Wiki:   Source material → [LLM fetch] → Snap → UModel → Wiki Page  
                          ↑ Entirely dependent on LLM, confidence level 0.4–0.9  

Code Wiki:   Code Repository → [AST deterministic fetch] + [LLM semantics enhancement] → UModel → CLI query  
                          ↑ Structural relationships determined (1.0)   ↑ Summary/attribution supplement (0.6–0.9)

This layer of determinism is critical to the agent's reasoning: when the agent performs RCA, it needs to trust every hop on the invocation chain. If a calls relationship is guessed by the LLM, the entire reasoning chain becomes unreliable. Relationships fetched by AST are deterministic facts that the agent can trust unconditionally.

At the same time, the code wiki retains the LLM enhancement capabilities of the personal wiki: semantic layer information such as module summaries, document-code associations, and widget attributions is still generated by the LLM, annotated as INFERRED, and the agent can selectively accept it.

Entity + Log + Link: Not Just a Structure Graph
The core design of UModel in the observability realm is to describe the IT world with a graph composed of sets and links: EntitySet describes the current state of entities, LogSet describes timing management events, MetricSet describes measure indicators, and Link connects them into a network.

When we apply the same modeling methodology to the code realm, we get more than just a structure graph.

Entity: Current Code Structure
Five types of EntitySets describe the current state of the code and support the coexistence of multiple repositories through repo_id composite primary keys:

repo_id participates in the primary key calculation (Entity ID = md5(repo_id:pk_value)), so that modules with the same name in different repositories do not conflict, and a single graph can accommodate multiple projects simultaneously.

Six types of EntitySetLink describe structural relationships: contains, imports, calls, extends, describes, and belongs_to. Each relationship is annotated with confidence and extraction_method (EXTRACTED / INFERRED / AMBIGUOUS).

Log: The Change History of Code
This is a critical watershed between Code-WIKI and all pure graph tools.

In the observability realm, we look at not only the current status of a pod (Entity), but also its logs and metric trends. Code is the same: looking only at the structure without the history is like looking at a single screenshot.

Logs in the code realm go far beyond Git commits:

The value of logs lies in the associated query with entities:

● Who modified this module in the last week? →commit_log WHERE module_path = X AND time > now()-7d

● Have any new incidents occurred since the last deployment? →deploy_log JOIN incident_log ON time_window

● Has the build time increased after introducing this dependency? →build_log GROUP BY week, cross-referencing dependency change time in commit_log

Each LogSet is associated with the corresponding EntitySet through DataLink. The agent can navigate from an entity to a log, or trace back from a log to an entity.

Cross-Domain Association: Code Is Not an Island
Code never exists in isolation. It serves requirements, reaches production through CICD, generates observable data at runtime, and traces back to the code for troubleshooting when issues arise. In the current toolchain, each link is an island: requirements are in Jira, code is in Git, builds are in Jenkins, services run in K8s, and alerts are in the monitoring system.

When a production alert fires, how many systems must you jump through and how many pieces of info must you manually correlate to trace from the alert back to the code change?

The value of UModel is that all these entities can live in the same graph.

Technical Architecture: Dual-Track Fetch + Graph Build
Overall Pipeline

DETECT: Incremental Change Detection
A SHA256 content fingerprint is computed for each file and compared against the cache from the last build. For vibeops-agents (~2,375 Go files), an incremental build typically processes only dozens of changed files, reducing the time from minutes to seconds.

EXTRACT: AST + LLM Dual Track
AST track (tree-sitter): A PEG-based incremental resolver that supports 40+ languages. It uses tags.scm rules to consistently fetch definitions, references, structural relationships, import relationships, invocation relationships, and inheritance relationships across languages. All extraction results have a confidence level of 1.0.

Notably, CodeIndex solutions such as Cursor also use tree-sitter. However, they use tree-sitter for semantic text segmentation (splitting code into chunks suitable for embedding), whereas we use tree-sitter for structure extraction (fetching deterministic relationships such as definitions, references, invocations, and inheritance). The same resolver serves completely different goals: the former produces vectors, and the latter produces a graph.

LLM track: Module summaries (agent context injection segments, not human-readable documents), document-code associations, and widget attribution. Each is annotated with extraction_method: INFERRED + confidence level. Agents can select a trust threshold by scenario: RCA prefers high confidence levels, while exploration scenarios can be relaxed.

RESOLVE: Cross-file Symbol Parsing
Single-file AST cannot resolve cross-file references. RESOLVE handles the following:

● Go import github.com/org/repo/pkg/a2a→ module_path pkg/a2a

● Method receiver type (s *Server)→ attribution code.type pkg/server.Server

● Invoke s.HandleRequest()→pkg/server.Server.HandleRequest

● Interface implementation type Adapter struct implements Handler→ extends relationship

Deterministic parsing, no dependency on LLM.

BUILD: Graph Assembly + Architecture Discovery
Architecture discovery is not simple community detection: Louvain/Leiden discovers clusters, not architectures. Complete flow:

Step 1: Graph construction  
  Modules as edge zones, imports + calls + extends as directed edges  
  Edge weight: calls > imports > extends  

Step 2: Hierarchical analysis  
  Compute dependency directionality: A→B and B↛A → A is above B  
  Detect top-level entries with indegree = 0 and underlying infrastructure with outdegree = 0  

Step 3: Community detection  
  Leiden algorithm discovers functional clusters on directed graphs  
  Resolution parameter controls granularity (~150 modules → ~15 widgets)  

Step 4: Annotation and naming  
  Annotate hierarchy based on dependency direction: API/Gateway, Service/Business, Infrastructure/Utility  
  LLM naming and description, cross-validation with project documents

The output is a hierarchical, directional, named architecture view. The agent can use this to determine whether an invocation crosses architecture layers.

SYNC: Synchronize to UModel

Entity write: starops umodel post-logs → __entity logstore  
Topo write:  starops umodel post-logs → __topo logstore  
Schema synchronization: starops umodel sync (register EntitySet/Link definitions)

The UModel backend is based on the Simple Log Service storage engine and inherits capabilities such as high-throughput writes, second-level query, graph-match graph traversal, SQL aggregation, and full-text index.

SERVE: Engineering Details of the Query
Key patterns explored in practice:

Two-step query: graph-match returns entity_id without business fields. All graph traversal queries first traverse the topology to obtain the ID set, then pull business fields in batches:

Step 1: .topo | graph-match (n1:code@code.module {__entity_id__: '<id>'})  
              -[e]->(n2) project n1, e, n2  

Step 2: .entity with(domain='code', name='code.module', ids=['id1','id2',...])

Aggregation via direct Simple Log Service (SLS) query: Statistical queries such as hot spot analysis directly run SQL against the __topo Logstore:

SELECT dest_entity_id, count(1) as import_count

FROM log WHERE relation_type = 'imports'

GROUP BY dest_entity_id

ORDER BY import_count DESC LIMIT 20

At the current multi-repository scale (~11,000 entities, ~19,000 edges, including the vibeops-agents and starops-cli projects), the end-to-end latency of a single query is in the hundreds of milliseconds.

Agent Interaction Layer: Command-Line Interface (CLI) + Skill
CLI Design
The agent's reasoning is progressive: search first, see the results, and then decide the next step. The CLI's search→context→impact naturally matches this pattern and supports batch execution and MPS queue combinations.

code-wiki query <subcommand>     # graph query  
  ├── search <keyword>       # entity search  
  ├── context <name>         # full context of a symbol  
  ├── impact <path>          # change impact analysis  
  ├── callers / callees      # invocation chain  
  ├── deps / rdeps           # dependencies / reverse dependencies  

code-wiki check <subcommand>     # administration check  
  ├── arch                   # architecture violation scan  
  └── hotspots               # coupling hot spots  

code-wiki ingest             # build/update graph  
code-wiki status             # health check

Subcommands are organized by agent intent. The agent does not need to know whether the underlying implementation is graph-match or Simple Log Service SQL: use impact to view the impact scope.

Output Format: Optimized for the Agent Context Window
The default --format brief output is optimized for the agent's token budget:

$ code-wiki query context pkg/a2a  

Module: pkg/a2a  
  LOC: 1,247 | Language: Go | Component: a2a-protocol  
  Summary: A2A protocol implementation for agent-to-agent communication  

Types (17): TaskStore(struct), A2AServer(struct), AgentCard(struct), ...  
Functions (52): HandleA2ARequest[entry], StartA2AServer[entry], ...  
Reverse dependencies (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  
Component crossings: → api, → scheduler

The output of a query context is < 500 tokens. Use --format json when full data is required.

Skill: Scenario-based User Guide
Agent Skills with the command-line interface (CLI) are organized by scenario. Agents do not need to learn Structured Process Language syntax:

## RCA: From alerting to code  
code-wiki query search <keyword>       # Locate module  
code-wiki query context <module>      # Understand structure  
code-wiki query callers <function>    # Trace invocation chain  

## Development: Evaluate impact before changing code  
code-wiki query impact <module>       # Impact scope  
code-wiki query deps / rdeps          # Dependencies  

## Administration: Regular checks  
code-wiki check arch                  # Architecture violations  
code-wiki check hotspots              # Hot spot analysis

Case Study
Case 1: Agent Independently Completes Change Impact Assessment
We assign a job to a sub-agent:

Assess the impact scope of modifying the pkg/a2a module in the vibeops-agents project, including affected modules and entry points, upstream and downstream dependencies, and architecture threats.

The sub-agent completed the following inference only by code-wiki** command-line interface (CLI) and Skill**, without reading any source code files:

$ code-wiki query search a2a  
Found 9 entities matching a2a:  
  [module] pkg/a2a                    LOC:1247  A2A protocol core  
  [module] pkg/a2a/adapter            LOC:834   A2A adapter implementations  
  [module] pkg/a2a/taskstore          LOC:567   A2A task state management  
  ...  

$ code-wiki query context pkg/a2a  
Module: pkg/a2a  
  Types (17): TaskStore, A2AServer, AgentCard, ...  
  Functions (52): HandleA2ARequest[entry], StartA2AServer[entry], ...  
  Reverse dependencies (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  

$ code-wiki query impact pkg/a2a  
Impact of pkg/a2a:  
  Direct dependents (9): pkg/api/handler, pkg/server, cmd/vibeops-agents, ...  
  Entry points affected: HandleA2ARequest, StartA2AServer, main  
  Component crossings: a2a-protocol → api, a2a-protocol → server  

$ code-wiki check hotspots  
Top hotspots by import count:  
  1. pkg/a2a/adapter    (48 imports)  ← High coupling warning  
  2. pkg/util/logger     (35 imports)  
  3. pkg/config          (32 imports)

Agent final output:

=== A2A Protocol Modification Impact Assessment ===  
  
Scope of Impact:  
- Directly dependent modules: 9 (pkg/api/handler, pkg/server, etc.)  
- Entry points affected: 3 (HandleA2ARequest, StartA2AServer, main)  
- Cross-component borders: 2 (a2a-protocol → api, a2a-protocol → server)  
  
Threat points:  
- pkg/a2a/adapter is imported by 48 modules, the highest coupling degree in the entire project  
- bus factor = 1, consider adding a reviewer  
  
Suggested execution order:  
1. Modify the pkg/a2a core protocol definition first  
2. Synchronize the interface implementation of pkg/a2a/adapter  
3. Modify the invocation method of pkg/api/handler  
4. Update the initialization logic of pkg/server  
5. Regression testing on the cmd/vibeops-agents entry point

5 commands in total, each taking 1–3 seconds, with a total time of less than 15 seconds. The Agent did not read any source code files and completed a full impact assessment solely through graph queries.

Case 2: RCA: From Alerting to Code Root Cause
Production alerting: service-vibeops error_rate > 5%

# 1. Locate the code module from the O&M entity  
$ code-wiki query context pkg/server  
Module: pkg/server  
  Functions (23): StartServer[entry], handleRequest, applyMiddleware, ...  
  Dependencies (12): pkg/a2a, pkg/config, pkg/auth, ...  
  
# 2. Trace the invocation chain to locate the potentially faulty downstream  
$ code-wiki query callees pkg/server.handleRequest  
Callees of handleRequest:  
  pkg/auth.ValidateToken          [component: auth]  
  pkg/a2a.HandleA2ARequest        [component: a2a-protocol]  
  pkg/scheduler.DispatchTask      [component: scheduler]  
  
# 3. Check commit_log and find that the a2a module was changed 2 hours ago  
#    author=xxx, message=refactor adapter interface  
  
# 4. Confirm the impact of the change  
$ code-wiki query impact pkg/a2a  
Impact of pkg/a2a:  
  Direct dependents (9): pkg/api/handler, pkg/server, ...  
  Entry points affected: HandleA2ARequest, StartA2AServer  
  
# → Root cause: The a2a interface refactoring affected the server invocation chain. Check interface compatibility.

Case 3: Architecture Administration: Detecting Architecture Decay

# 1. Scan for architecture violations  
$ code-wiki check arch  
Architecture violations:  
  pkg/util/logger calls pkg/api/handler.GetRequestID  
    [utility → api] The utility layer should not invoke the api layer  
  pkg/config calls pkg/scheduler.GetDefaultConfig  
    [infra → service] The infrastructure layer should not depend on the business layer  
  
# 2. Identify coupling hot spots  
$ code-wiki check hotspots  
Top hotspots:  
  1. pkg/a2a/adapter      48 imports  [HIGH]  
  2. pkg/util/logger       35 imports  [NORMAL]  
  3. pkg/scheduler/queue   28 imports  [MEDIUM]  
  
# 3. Analyze the highly coupled module in depth  
$ code-wiki query rdeps pkg/a2a/adapter  
Reverse dependencies (48):  
  pkg/api/* (12 modules), pkg/server/* (8 modules), pkg/scheduler/* (6 modules), ...  
  
# Agent suggests splitting into adapter/protocol, adapter/transform, and adapter/routing

Outlook
Comprehensive Digital Evaluation
We plan to build a standardized code comprehension evaluation benchmark covering core scenarios such as impact analysis, invocation chain tracing, architecture violation detection, and RCA root cause localization. On real codebases of varying scales, we will compare the performance of three paradigms — Model + Bash (Agentic Search), Model + CodeWiki (LLM document), and Model + UModel (knowledge graph) — across dimensions including accuracy, recall rate, number of inference steps, and token consumption.

Use SWE-bench-style quantization evaluation to make the capability borders of each paradigm measurable and reproducible. Based on this, optimize the overall technical architecture based on benchmark fractions, including iterative upgrades to related skills and the command-line interface (CLI).

Agent Self-Maintenance
Agents are not just graph consumers, they can also be maintainers:

● After a code schema evolution, the associated LLM-inferred relationships are marked for reevaluation

● Regularly inspect orphaned entities, missing relationships, and expired data

● On top of the above capabilities, a verification and quality assessment system is also needed to make self-maintenance controllable.

Architecture Guard Gate
Integrated into the CI flow, automatically run on PR:

codecode-wiki ingest --incremental        # Incremental graph update  
code-wiki check arch                  # Architecture violation check  
code-wiki query impact <changed_files> # Change impact analysis

From Observable to Understandable
From modeling observable data to modeling code knowledge, from describing running systems with Entity + Log to describing code systems with Entity + Log: UModel is evolving from observing IT systems to understanding the code and procedures that build them.

When agents truly understand the structure, history, and production performance of code simultaneously, genuinely AI-native software engineering becomes possible.