ObservabilityGuy

Posted on Jun 11

From Black Box to Transparent: Alibaba Cloud Agent Observability and Audit Data Collection in Practice

#agents #observability #beginners

This article introduces Alibaba Cloud's LoongSuite solution for comprehensive AI agent observability and audit data collection using extended OpenTelemetry GenAI semantic conventions.

I. Introduction

In 2025, AI agents are moving from the lab to large-scale production. From code assistants used by developers daily to intelligent customer service in enterprise service scenarios, to multi-agent collaboration systems of ever-increasing complexity, AI agents are reshaping software development and business operations at an unprecedented pace.

However, once agents are actually running, a critical problem emerges: the actual runtime behavior of AI agents is difficult to observe, trace, and govern.

A coding agent autonomously and without authorization modifies core configuration files overnight, with no way to know what changed or why. An intelligent customer service agent autonomously issues a "cancel order" instruction, yet the decision logic, tool calling chain, and token resource consumption cannot be reviewed. A multi-agent collaborative job fails midway, and the failure node and root cause are difficult to pinpoint.

These issues point to a common requirement: AI agents need comprehensive observability. Moreover, this observability cannot remain at the shallow statistical dimension of "request success/failure" — it must deeply cover AI agent-specific runtime aspects such as LLM invocation, tool execution, multi-round inference, and memory retrieval.

Based on the OpenTelemetry (OTel) community standard and its in-depth practices in observability fields, Alibaba Cloud has developed a complete data collection solution that covers three types of agent forms. Building on the OTel GenAI semantic conventions, Alibaba Cloud has released the LoongSuite GenAI semantic conventions for observability. This paper will systematically introduce the design concept, technical implementation and use of this scheme.

II. Agent Form Classification and Observability Challenges

The AI agent market is thriving and highly diverse. The runtime models, deployment environments, and use cases of different agent types vary significantly, and their observability and audit needs differ accordingly. We classify mainstream AI agents on the market into three categories:

2.1 Three Major Forms of Agent

2.2 Three Core Challenges
No matter what form is adopted, AI agents will encounter three common problems after large-scale use:

The execution process is black-boxed. The execution process of the agent involves LLM calls, tool execution, multi-round reasoning, and memory retrieval. The traditional Metrics, Log, and Trace methods cannot effectively describe this new computing paradigm. For example, in a round of Agent tasks that contain 10 rounds of ReAct reasoning, the traditional solution can only identify 10 independent HTTP requests and cannot restore a complete hierarchical and orderly decision-making process.
The behavior trajectory is difficult to trace. The agent has high independent operation permissions and can read and write local files, run system commands, and call third-party API operations. Without special audit capabilities, all operations of agents cannot be traced. This poses high risks in enterprise security and compliance control scenarios.
Cost is hard to quantify. Token consumption of large models is the main cost source of agents. Multiple rounds of iterations and tool calls will exponentially increase consumption. Without the ability to fine-tune cost splitting by agent, user, and task, enterprises will not be able to carry out budget control and input-output evaluation.

III. A Differentiated Collection Approach: Adapting to Agents' Native Runtime Forms**

Core design principle: Adapt the data collection capability to the native running mode of the AI Agent instead of forcing the Agent to adapt to the data collection tools.

3.1 Coding Agent: LoongSuite Pilot, a Lightweight Client-Side Data Collector
Coding agents run on the developer's local machine, where all core behaviors — code edits, file creation, terminal command execution — happen in the local environment, completely invisible to traditional server-side agents. To address this, we built LoongSuite Pilot, a client-side data collection platform purpose-built for coding agents.

Core Advantages

One-time deployment, full coverage. Pilot is not a solution exclusive to a single agent, but a unified platform. It currently supports five mainstream coding agents: Claude Code, Codex, Cursor, Qoder, and QoderWork. Developers only need to install it once to automatically collect data from all code assistants in use, with no repeated configuration required.
Silent background execution with zero disruption. Pilot runs as a local daemon process in the background, automatically detecting installed coding agents on the device and deploying capabilities. Developers do not need to modify agent configurations or change usage habits at any point. All behaviors, including LLM invocations, tool execution, and code modifications, are seamlessly recorded.
Resumable collection for stable and reliable data. A built-in breakpoint-resumable collection mechanism handles unstable scenarios such as network fluctuations on local devices, device restarts, and terminal shutdowns. After a process is abnormally interrupted and restarted, no data duplication or data loss occurs, ensuring data integrity.
Flexible collection granularity that balances observability and data security. Different teams have different data security requirements. Pilot supports flexible configuration of collection granularity by agent type. For complete audit needs, detailed info such as message content and tool parameters can be collected. In data-sensitive scenarios, only metadata (model name, token consumption, duration, etc.) is reported, achieving a precise balance between observability requirements and data security.
Plugin architecture, quickly compatible with new agents. Pilot uses a plugin architecture and provides out-of-the-box collection base classes for different agent data formats, such as hook logs, IDE snapshots, SQLite databases, and session files. Integrating a new Coding Agent requires implementing only 2-3 abstract methods, enabling you to quickly keep up with ecosystem iterations. Supported Coding Agents and Coverage

3.2 Personal General-Purpose Assistant: One-Line Command for Full Observability and Audit

Personal general-purpose assistants usually run as standalone services, providing end users with dialogue and task-execution capabilities. For this type of agent, we provide a dedicated plugin that enables full tracing with a single command.

Design philosophy

Take OpenClaw as an example. Although its built-in diagnostics-otel extension can output Metrics and some Trace, it adopts an event-driven architecture. Span is created independently for each event, and there is no parent-child relationship between each other and Trace Context propagation. In essence, it is a group of "standalone data points". The openclaw plug-in of LoongSuite is a complete distributed tracing by design-all Span share the same traceId and are connected together into a call tree through an explicit parent-child relationship.

Span Semantic Model

Each type of span is connected to a complete trace tree by using parent-child relationships. O&M personnel can view the number of large model calls, token consumption, tool call list, time-consuming nodes, and fault information of a single request.

Essential differences from built-in observability

Compared with the built-in observability capabilities of OpenClaw, LoongSuite plug-ins are different in two aspects:

Link integrity. Built-in observability is usually flat and independent, and there is no correlation between events. However, our plug-in is based on the OTel Context propagation mechanism to ensure that ENTRY → AGENT → STEP → LLM / TOOL forms a complete call tree, which can restore the complete picture of a request.

Data richness. Built-in observability often only records basic metrics such as model usage, while our plug-ins fully record fields such as gen_ai.input.messages, gen_ai.output.messages, gen_ai.system.instructions, gen_ai.tool.call.arguments, and gen_ai.tool.call.result to meet the needs of in-depth audit and troubleshooting.

The same plug-in mechanism already covers personal general-purpose assistants such as Hermes Agent and QwenPaw.

3.3 High-and-Low-Code Framework Agent: Zero-Code Instrumentation with the LoongSuite Python Agent
For agent applications built on frameworks such as LangChain, AgentScope, and Dify, the runtime behaves like a traditional Python application. We provide the LoongSuite Python Agent (deeply customized from OpenTelemetry Python Contrib), which achieves zero-code automatic instrumentation with a single command.

Quick start

# 1. Install the LoongSuite Python Agent pip install loongsuite-distro
# 2. Auto-detect and install the required instrumentation libraries
loongsuite-bootstrap
# 3. Start with one command; probes are injected automatically
loongsuite-instrument \
  --traces_exporter otlp \
  --service_name my-agent-app \
  python my_agent_app.py

loongsuite-bootstrap automatically scans for installed frameworks (such as langchain, dashscope, and mcp) in the current environment and installs the corresponding instrumentation packages-developers do not need to manually select and install them.

Framework Coverage

At present, 16 instrumentation libraries have been covered in the LoongSuite Python Agent, covering the mainstream AI agent development framework:

Automatically Recognized Span Types

The probe automatically detects and generates multiple GenAI span types, covering the entire agent lifecycle:

ENTRY: Request entry
AGENT: Agent execution unit
STEP: ReAct reasoning-action iteration step
LLM: LLM invocation, including request parameters, token consumption, and input/output messages
TOOL: tool calling, including tool name, parameter, and result
MCP: MCP protocol invocation
CHAIN: chained invocation orchestration
RETRIEVER: retrieval operations
EMBEDDING: embedding operations
RERANKER: reranking operations
WORKFLOW: workflow orchestration

IV. Observability and Audit Results

After accessing the preceding collection capabilities, users can obtain observability views in the following dimensions. Take Claude Code as an example. If you want to enable Agent Observability, you only need to log in to CloudMonitor 2.0 Console, click the corresponding card in the access center and follow the steps to complete the installation and access with one line of command.

4.1 End-to-End Agent Call Chain View
The complete execution process of the agent is presented in the form of a trace tree, from the user request entry (ENTRY) to the agent decision (AGENT), inference step (STEP), LLM call (LLM), and tool execution (TOOL). The hierarchical relationship is clear at a glance. For complex tasks with multiple rounds of ReAct, you can use Step Span to quickly locate which iteration has a problem, and then go to the LLM or Tool Span in the round to analyze the root cause.

Troubleshooting pattern: When an agent executes a 10-round ReAct process, you can first use Step Span to identify which round of the problem occurred, and then analyze the specific step in the round. This top-down troubleshooting method greatly improves the fault locating efficiency of complex agents.

4.2 Token Usage and Cost Tracking
Based on gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.usage.total_tokens , as well as cost fields extended by Alibaba Cloud (input_cost, output_cost, and total_cost), you can:

Token usage details for a single request
Cost aggregation by agent / user / time
Cache token fields (cache_read.input_tokens, cache_creation.input_tokens) to evaluate cache policy effectiveness

4.3 Session and Multi-Turn Conversation Tracking
Through gen_ai.session.id, gen_ai.turn.id and gen_ai.step.id to build a three-level identification system to achieve:

Full conversation traceability across multiple rounds of conversation

Step-level fine-grained analysis in a single-round dialogue

Session path analysis and user behavior insights

4.4 Tool Call Audit
You can record the tools that are called by the agent, the parameters that are specified, the results that are returned, and the duration. For the Coding Agent, this means that every file read or write and every command execution is documented. For MCP protocol calls, complete request-response auditing is also provided.

Behavior Analysis Dashboard**
**
The top count card divides tool calls into dimensions such as command execution, file reading and writing, search, web browsing, and MCP calls by behavior type, and marks the categories with abnormally high call volume with striking red or orange colors to provide a quick snapshot of the overall behavior composition. The right side displays the number of active sessions and the number of users at the same time, which is convenient for correlating the behavior popularity with the usage scale. The session statistics table below is expanded by session and records the number of calls in each session in each dimension of behavior. This allows you to locate the sessions and users in which high-frequency operations are concentrated.

Tool Call Distribution

The tool invocation distribution page presents the tool usage structure from two perspectives. The pie chart on the left shows the type proportion of all tool calls (such as Read, Write, Bash, TodoWrite, etc.) to help the team understand which tool capabilities the agent relies on most. The pie chart on the right shows the distribution of MCP tool calls independently, revealing which external capabilities are frequently called in cross-system integration. The trend comparison chart below shows the changes in the number of calls for each tool type in a timeline, making it easy to identify phased changes in call patterns-for example, a surge in Bash calls on a certain day may indicate batch script tasks or abnormal behavior.

Security Audit Overview

The Overview page compresses the security situation of AI agents into a screen-readable risk snapshot based on the multi-dimensional high-risk operation count within a specified time window. The funnel on the left side gradually converges from full sessions to sessions with security risks. This visually shows the proportion of risk surfaces. On the right side, metrics such as high-risk command execution, outbound web requests, outbound command-line requests, sensitive file access, and prompt injection are displayed side by side. With the comparison data, the security team can quickly determine whether the current risk level is abnormal without in-depth details.

What is particularly noteworthy is the count of high-risk operations after the prompt injection event. Ordinary high-risk operations may originate from the reasonable requirements of the task itself, while high-risk behaviors triggered by injection are strong threat signals-this means that the injected malicious instructions have driven the Agent to execute. Even if there is a false positive, such signals should trigger a manual review at the highest level, rather than waiting for further confirmation. Therefore, the “number of tool-calling sessions following prompt injection” is the highest-confidence Indicator of Compromise (IoC) in the entire overview. The priority of 3 such sessions is often higher than that of hundreds of ordinary high-risk commands.

High-Risk Session Tracing

Two-stage drill-down capability is provided below. The upper layer is a high-risk session risk score table, which aggregates the risk counts of each dimension (injection hits, high-risk operations, sensitive file accesses, and outbound information) by session, and automatically sorts the comprehensive risk score to present the sessions that require the most manual intervention. The security team does not need to screen logs one by one. Instead, the security team directly starts tracing from the session with the highest risk, greatly reducing the time window from discovery to response.

The lower layer is a high-risk event summary table, which drills risk down to individual event granularity-specific time, user, session, event type, tool name involved, threat type, and complete context content, providing security analysts with the original evidence required for final characterization.

V. Deep Extensions Based on the OTel GenAI Semantic Conventions

The data capabilities of the observability system of Alibaba Cloud AI Agent are built based on the self-developed LoongSuite GenAI Observability Semantic conventions. This specification is based on the OTel GenAI standard in the community and fills the semantic gaps in real business scenarios.

5.1 Why Extend Beyond Community Standards
As early as the beginning of 2024, OpenTelemetry started driving GenAI semantics specification development, aiming to establish a unified observability data language. Community standards have laid an important foundation:

gen_ai.operation.name: Standardized operation types (chat, embeddings, execute_tool, etc.)
gen_ai.span.kind: Differentiates span types such as LLM, CHAIN, AGENT, TOOL, and RETRIEVER
gen_ai.request.model / gen_ai.response.model: Model identity
gen_ai.usage.input_tokens / output_tokens / total_tokens: Token usage
gen_ai.input.messages / gen_ai.output.messages: Input and output messages
gen_ai.response.finish_reasons: Model stop reason

However, community standards inherently need to balance broad applicability with long-term stability, resulting in a relatively cautious pace of evolution. The current OTel GenAI semantic conventions is still in Development status, and many new concepts and scenarios are still being absorbed and converging.

In practice at Alibaba and Ant Group, we encountered many more complex and granular real-world scenarios. For example, a seemingly simple scenario of "ordering milk tea with Qwen" actually involves cross-domain coordination among multiple business systems, including Qwen Agent, Flash Sale Agent, Amap Agent, and Alipay Agent. These scenarios place higher demands on semantic expressiveness.

To this end, based on the OTel GenAI community standard and drawing from extensive internal hands-on experience, we released the LoongSuite GenAI Observability Semantic conventions. In 2026, the specification was officially open-sourced as a vendor extension standard for OTel GenAI, with plans to gradually contribute optimization capabilities upstream to the community.

5.2 Selected Core Extensions
Extension 1: Entry Span and Step Span — Making Complex Agent Call Chains Readable

Problem background: When an agent executes a long-running job, a single trace may contain hundreds or even thousands of spans. The native standard cannot distinguish business levels, making call chains cluttered and difficult to analyze.

Semantic Modeling:

Entry Span (gen_ai.span.kind = ENTRY ): Created at the entrance of the agent call, used to restore the original input and output of the model and the user to form the dialogue history. Ensure that when processing downstream tasks, the data is not polluted by System Prompt or framework Prompt, and the most original customer request can be obtained.
Step Span (gen_ai.operation.name = react ): represents the hierarchical expression of Agent in each ReAct process. Each ReAct completes the cycle of "reflection → tool invocation → model invocation", identifying the turn by gen_ai.react.round. The round-by-round span structure makes the trajectory of each loop clear at a glance. This semantic conventions has been implemented in multiple scenarios such as OpenClaw, QwenPaw, and Hermes Agent.

Extension 2: Skill Semantics — Making Business Function Domains Observable

Background: In scenarios such as e-commerce shopping assistants, commands are routed to the corresponding Skill after the agent understands the intent. Existing semantic conventions lack an abstraction of the business function aggregation layer of Skill.

Semantic Modeling: gen_ai.skill.* attribute family is added:

At the current stage, these attributes are attached to the execute_tool Span and quickly landed. At the same time, we have implemented an independent invoke_skill Span scheme and submitted a proposal to the OTel community (#3540).

Downstream value: Observability Platform can be aggregated and analyzed by functional domain to quickly identify "which Skill has the highest error rate", compare "whether the latency of the new version of Skill is degraded after it is launched", and measure "the proportion of Skill execution time spent on LLM calls".

5.3 Engineering Implementation: GenAI Utils
The value of semantic conventions lies not only in documents, but also in engineering implementation. We implemented GenAI Utils in the probe as an engineering capability layer for the LoongSuite SemConv:

Data extraction only at the instrumentation layer: Each framework instrumentation library intercepts framework calls by using hooks or Monkey-Patch, and fills data into the corresponding Invocation data object.
GenAI Utils unified telemetry output: All span creation, attribute mounting, metrics recording, event sending, and context management are completed by the ExtendedTelemetryHandler.
Only one specification update: When LoongSuite SemConv adds new fields or adjusts the structure, you only need to modify GenAI Utils. All downstream instrumentation libraries automatically take effect.

Supported Invocation types include LLMInvocation, InvokeAgentInvocation, CreateAgentInvocation, ExecuteToolInvocation, EmbeddingInvocation, RetrieveInvocation, RerankInvocation, and MemoryInvocation, covering the entire lifecycle of GenAI.

GenAI Utils has versions of Python, Node.js, and Go, and the Java version will be released soon. Among them, Python and Node.js versions have been open-sourced, and the rest will be open source one after another.

VI. Summary

The Alibaba Cloud Agent observability and audit solution is applicable to the following scenarios:

The popularity of AI agents has greatly improved production and office efficiency, and also put forward new requirements for observability, auditability, and governance capabilities. Different from traditional microservices and web applications, AI Agent integrates new operation modes such as LLM calls, tool execution, and multi-turn reasoning. It must support exclusive data collection and semantic standards.

The Alibaba Cloud LoongSuite solution provides full coverage for the following types of mainstream agents:

LoongSuite Pilot eliminates blackboxes for locally running coding agents such as Claude Code, Cursor, Codex, Qoder, and QoderWork.
Dedicated plug-ins (OpenClaw, Hermes Agent, QwenPaw) give personal general-purpose assistants full tracing capabilities.
The LoongSuite Python Agent, which is open source and uses 16 framework instrumentation libraries, allows agent applications developed based on frameworks such as LangChain, AgentScope, Dify, and MCP to implement zero-code access.

More importantly, the LoongSuite GenAI Observability Semantic conventions, which is based on the OTel GenAI Semantic conventions, is open source. It uses key semantic extensions such as Entry, Step Span, and Skill semantics to fill the semantic gaps of community standards in real business scenarios. With the engineering package of GenAI Utils, this ensures unified standard implementation and efficient iteration.

The ultimate goal of a unified semantic conventions is not to produce a single document, but to enable all users and vendors who use the specification to see, analyze, govern, and evolve the rapidly growing GenAI applications.

Related links: