ObservabilityGuy

Posted on Jun 5

Alibaba & Ant Group LoongSuite GenAI Observability Semantics Specification: From Unified Data Language to Large-scale Implementation

#observability #ai #beginners

This article introduces LoongSuite GenAI SemConv, a unified observability specification extending OpenTelemetry with enhanced semantics for AI agents, skills, and token-level inference.

Background
With the rapid development of AI, especially generative AI (GenAI), a large number of new core concepts emerge in AI Agent systems, such as models, prompts, tokens, tool calling, agents, memory, and sessions. These concepts have become the observation objects that algorithm engineers, O&M engineers, and observability platform users care about the most. They need to be collected, displayed, and consumed in a standardized manner, in the same way as HTTP requests and database invocations in traditional systems. This allows system maintainers to clearly understand the invocation procedure and efficiently troubleshoot issues.

Based on this, OpenTelemetry (OTel) began to promote the construction of GenAI semantics specifications as early as the beginning of 2024. It hopes to establish a unified data collection specification, Semantic Conventions (SemConv), for these new objects. This aims to solve problems in related realms, such as the lack of observable data collection standards and inconsistent calibers.

SemConv Positioning and Value
Observable data collection tools, such as auto instrumentation or SDKs for various languages such as Java, Go, and Python, may be considered the core value of the OTel community by many people who are new to OTel.

However, after you deeply understand the community, you will find that compared to SemConv, these collection capabilities play more of a role of "tactics." They serve the true "philosophy" of OTel, which is to establish a unified observable data language through SemConv. **OTel SemConv is a set of observable data collection standards jointly designed and continuously evolved by dozens of top observability vendors and hundreds of realm experts around the world. **Over the past few years, after communicating with core maintainers and co-founders of the community at multiple KubeCon conferences, we learned that in their eyes, SemConv is the soul of OTel. Promoting its gradual improvement and moving towards Stable is the most important work of the community.

A unified observability SemConv can achieve the following effects:

Unified data language to resolve inconsistent calibers

Take GenAI semantics as an example. Its common scenarios naturally span across models, frameworks, and platforms. When there is no unified semantics specification, different teams often record information such as "model name," "input length," "token count," and "response content" separately. Field naming and statistical calibers cannot be aligned. The core value of OTel GenAI SemConv lies in providing standardized fields for these common concepts, such as gen_ai.system, gen_ai.request.model, and gen_ai.usage.input_tokens.

Once these key fields are standardized, different businesses, different infrastructures, and different observation backends can share the same analysis method. This truly achieves "explaining the same category of problems with the same set of data." This is also the most basic and important value of semantics specifications.

Support the unified administration of performance, cost, quality, and security

The target of observability construction is not only troubleshooting but also the continuous governance of performance, efficiency, security, and output behavior. For example, in the GenAI SemConv scenario, only after the unified SemConv standardizes key information such as model parameters, response metadata, and token usage, can the team more easily track performance, cost, and security-related issues.

For large enterprises, this means that the following practical demands can be resolved based on a unified standard:

● Technical troubleshooting: You can view the complete trace across agents through the Trace ID, and locate various problems at the minute level, such as abnormal invocation latency of a certain business model.

● Business analysis: Effect data is comparable across businesses and can be directly used for product decisions. This greatly improves the efficiency of roles such as BI, product, and data science when they perform cross-business analysis.

● Evaluation:The real user trajectories are continuously accumulated to automatically build evaluation datasets, especially for the end-to-end evaluation of multi-agent collaboration scenarios.

● Compliance: A unified audit trace meets the rigid requirements of security ICP filing.

If there is no unified semantics, these problems can only be analyzed locally within a single system, and group-level administration capabilities cannot be formed.

Reduce access costs and promote infrastructure reuse

One of the design Targets of OpenTelemetry (OTel) is to allow telemetry Data to reuse the same Collection and administration link through components such as standard protocols, semantics specifications, SDK, automatic instrumentation, and Collector. In Generative Artificial Intelligence (GenAI) scenarios, the value of unified semantics specifications is particularly evident here: once fields, Span structures, event models, and context propagation methods are clearly defined, non-intrusive instrumentation, SDK encapsulation, platform Analysis, Dashboards, and alert policies can all be reused.

This means that businesses do not need to start thinking about "what fields to collect" every time. Instead, businesses can directly integrate capabilities based on existing specifications to reduce overall construction costs.

Introduction to LoongSuite GenAI SemConv
Background
As the current de facto standard in the observability industry, although OTel started the discussion and design of GenAI semantics specifications as early as early 2024, the overall Update pace is relatively slow because the early human resource investment was limited and the community standard emphasized broad applicability and long-term stability. In contrast, Alibaba Group has a large number of Large Language Model (LLM) application implementation scenarios internally and has encountered a large number of case problems in real scenarios. Therefore, Alibaba Group has the requirement to abstract related problems into a unified standard.

2025: The observability teams of Alibaba Cloud, Alibaba Holding, and Ant Group jointly Started to perform semantics modeling on the Content that OTel has not yet covered in internal scenarios based on OTel GenAI semantics, and promoted the implementation and application of internal observability Collection tools based on this.

2026: After the communication with the main Maintainers of GenAI in the OTel community is completed, because the related Content is extensive and the iterations are fast, under the suggestion of the community Maintainers, the results are first open sourced under the Alibaba LoongSuite observability Brand as a vendor enhancement standard for OTel GenAI SemConv, and will be gradually contributed to the OTel upstream at an appropriate time later.

Content and Implementation
Currently, this specification has been implemented in multiple core scenarios within the group, forming full-stack observability capabilities from the Agent layer to the infrastructure layer. For example, the following is some enhanced Content of the related Loongsuite GenAI SemConv compared to OTel GenAI SemConv:

New Entry/Step Span
Problem Background
In the practice procedure of AI Agent, we found that when the Agent executes long-term Jobs, the execution logic of the Agent becomes increasingly complex. It will contain multiple rounds of tool calling and model invocations, causing a single Trace to contain hundreds or thousands of Spans. These Spans appear very lengthy when the Spans are displayed in the same link, making it difficult to clearly observe the invocation chain trajectory. To solve this problem, we introduced the following two key designs:

Entry Span: A Span is created at the entry point of the Agent invocation, and is used to revert the original inputs and Outputs of the model and the User to form a dialogue History. This ensures that when Downstream Tasks are executed, the processed Data is not interfered with by the System Prompt or the frame Prompt, and the most original Customer Requests can be retrieved.
Step Span: Step represents the hierarchical expression of the Agent during each ReAct procedure. During each ReAct procedure, the Agent needs to complete the loop of "reflection → tool calling → model invocation". When problems are troubleshooted, a Top-down approach is usually adopted to locate the execution status of the Agent. The specific flow is: you can first observe the overall situation. For example, when the Agent executes a procedure containing 10 rounds of ReAct, you can first locate which round has a problem, and then deeply analyze which specific step in that round is wrong. Through this round-by-round Span structure, the multiple rounds of actions, reflections, and corresponding execution Results of the Agent can be clearly displayed, making the trajectory of each loop clear at a glance. Semantics Modeling The definitions of the newly added Entry and Step Span Types are as follows:

Implementation Effect
Currently, this semantics specification has been implemented in multiple Agent scenarios, including OpenClaw, QwenPaw, and Hermes Agent. The following is the effect after the semantics specification is implemented and integrated in the OpenClaw scenario:

New Skill Semantics
Problem Background
In Agent scenarios such as E-commerce shopping assistants, after the intent of each instruction of the User is understood by the AI Agent, the instruction is routed to the corresponding Skill to complete the execution. A Skill is the smallest reusable unit of business features, which internally orchestrates a group of LLM invocations and tool callings to complete specific Jobs, such as searching for Products, adding to the shopping cart, and requesting Refunds.

Existing OpenTelemetry (OTel) Generative Artificial Intelligence (GenAI) semantics conventions have covered Span Types such as Agent, Large Language Model (LLM), and Tool, but lack abstraction for the business feature aggregation layer of Skill. A Skill is neither a single Tool invoke nor a complete Agent, but an orchestration unit between the two. The lack of observability in the Skill dimension means that when Performance Fluctuation occurs, you can only see a heap of execute_tool and inference Spans. The lack of Skill observability leads to three core pain points:

● Inability to Attribute to the feature domain: When Performance Fluctuation occurs, you can only see a heap of execute_tooland inference Spans, and you cannot quickly determine which feature domain has a fault.

● Inability to calculate Skill health Metrics: Metrics such as P99 latency, Succeeded rate, and invoke frequency at the Skill granularity are missing.

● Trace obfuscation when multiple Skills are concurrent: The ownership of LLM or Tool Spans of different Skills cannot be distinguished in the Trace tree.

Semantics Modeling
To implement the Collection of Skill information, we added a group of gen_ai.skill.* properties in LoongSuite GenAI SemConv to identify the identity and Version information of a Skill:

At the current stage, these properties are attached to the existing execute_tool Span, which can be quickly implemented without the need to Import new Span Types.

At the same time, based on the group business, we implemented the solution of an independent invoke_skill Span, and committed a proposal to the OTel community to cover the complete lifecycle of a Skill from load to execution completion, supporting end-to-end Analysis by feature domain.

Implementation Effect
Through the Skill semantics properties, the observability platform can perform aggregation and analysis by feature domain: quickly locate "which Skill has the highest Error Rate", compare "whether the latency deteriorates after the new Version of Skill is published", and measure "the proportion of LLM invoke Duration to the total Skill Duration".

In addition, the same set of gen_ai.skill semantics conventions can also cover various frames, such as OpenClaw, Langchain, and Spring AI. The following is the instrumentation effect in the OpenClaw scenario:

New Token-level Inference Observation
Problem Background
In the first half of 2025, the Ant observability team built a full-link observability system around the Ant inference Alibaba Cloud service, covering the core widgets of the inference Alibaba Cloud service, and Built multi-language and multi-protocol distributed tracing Trace capabilities from the client to the DPI engine end. Among them, Ant collaborated with the Alibaba Cloud team to contribute basic DPI engine observability Traces to the community's three major inference DPI engines, vLLM, SGLang, and TensorRT-LLM, forming a de facto observability Trace standard at the Ant and Alibaba Group level. The entire observability system is an important stability foundation for the Ant inference Alibaba Cloud service.

However, with the vigorous development of the business, the pressure on the inference Alibaba Cloud service has intensified, and a large number of difficult problems related to the inference DPI engine have exhibited emergent behavior. The DPI engine Trace at the Request level can no longer effectively locate problems at a deeper depth. We deeply studied the underlying principles of the inference DPI engine, combined with actual production cases, and summarized the following problems:

Performance abnormality: The slow response of a single Request is often because certain Tokens are slow to Generate, and the slow Generation of Tokens is highly likely caused by the concurrent interference of other Requests.
Precision abnormality: Precision problems such as repetition, irrelevant answers, and garbled characters often start to be abnormal from a certain Token, and subsequent Tokens continue to make faults under this Impact. Therefore, the essence of the problem lies in the Token Generate procedure. From this, it is naturally inferred that the localization and demarcation of inference request problems must be supported by Token-level observable data.

Therefore, in the second half of 2025, the Ant observability team took the lead to Build the industry's first observability product that covers multiple inference DPI engines and Supports Token-level depth Trace, sinking observability from the macro Request down to the micro Token dimension. It not only follows whether a single Request Succeeded, but also deeply observes:

The Generate Duration and sub-stage procedure of each Token.
The mutual impact of multiple concurrent requests within the same infer instance when slow tokens are generated.
The Top-K candidate distribution behind each generated token helps pinpoint accuracy issues.
The core value of this work lies in that it decomposes many originally "black box" procedures inside the infer engine down to the token granularity for the first time, creating a transparent, explainable, and attributable white-box System.

Semantics Modeling
Brief introduction to how the infer engine works: The infer engine is essentially a System that executes an infinite loop of iterations. In each iteration, a batch of requests is selected based on resource conditions and the schedule policy to form a Batch, which serves as the execution Target of the current iteration for Batch Processing. After the iteration is completed, each selected Request usually generates a token. Then, it enters the next iteration, going through the same procedure of selecting requests to form a batch and then executing the batch. This loop continues in this way.

Token Performance Data Collection: At the token granularity of each Request, we collect the UNIX timestamps for entering and exiting the iteration. With these two UNIX timestamps, the scheduled time, actual running time, and total Duration of User Perception for each token can be deduced. In addition, the Request corresponding to each token is in a Batch. The total number of requests in the Batch (especially the total number of tokens) characterizes the payload of the batch processing, which further determines the Duration of token generation. Therefore, we define the following related properties that characterize the Performance Data at the token granularity:

Token accuracy data Collection: At the token granularity of each Request, we collect the probability distribution of the candidate Top-K tokens corresponding to each token. This distribution can be used to judge the Outputs quality of the model. For a model with poor quality, its Top candidate tokens are less likely to meet expectations. If the model Outputs meet expectations but the selected token is not in the Top-K, the issue points to the sampling parameters specified by the User, such as temperature. Therefore, we define the following properties related to the candidate token probabilities:

Implementation Results
Based on the GenAI specifications designed above, we collect and output standard Data on three major engines. Relying on this standard Data, a consistent feature interface is presented to the User. Ultimately, we have built an engine microscope product to provide the depth observation capabilities of the infer engine at the engine concurrency and token levels.

● Engine token Analysis: You can switch to a high-power microscope, focus on a single Request, and observe the Duration of each step in its internal token generation, as well as the probability distribution of the top candidate tokens, to accurately pinpoint the root cause of latency and abnormal accuracy issues.

● Engine concurrency profiling: You can use a wide-angle lens to clearly render the concurrency, competition, and collaboration relationships of all requests in the engine, and quickly detect resource contention and bottlenecks.

The token-granularity Performance Data from the engine token Analysis can reveal which tokens are slow. The engine concurrency Analysis further answers why these tokens are slow. In addition, the probability distribution Data at the token granularity can reveal whether the Large Language Model (LLM) Outputs of abnormal tokens are Normal or the sampling parameters setting is unreasonable. After the product was published, it went through the year-end sales promotions and successfully helped the engine, SRE, and business teams pinpoint multiple stability issues on the stability battlefield, accelerating the issue demarcation efficiency by 10 times. It truly achieved both speed and accuracy, and further provided optimization suggestions. Some typical cases are selected below to illustrate the product features and business value.

Case 1: Slow token localization and quick detection of cross-Request resource interference
You may often encounter a specific Request breaching the threshold in the production environment, such as the TPOT (Time Per Output Token) indicating the token Outputs speed breaching the threshold. For the User, this will be perceived as stuttering in the Outputs. The following case describes how the token Analysis and engine concurrency profiling help demarcate and pinpoint the issue in this scenario.

After we obtain the TraceId of the abnormal Request, we open the token Analysis Page as shown in the following graph. We can see that the 125th token took 6.8 s, which far exceeds expectations, ultimately causing the TPOT to reach up to 54.77 ms.

You can click Engine Concurrency Analysis in the upper-right corner of Token Analysis, and you are redirected to the concurrent profiling page of the corresponding engine instance. You can search for and locate the abnormal request based on Time or TraceId. This request is Request 2 in the following graph. We can see that Request 1 spent more than 6 s to generate the first Token (prefill phase) - the bright green block, which interrupted Request 2 to decode and generate the 125th Token (the yellow block). This is consistent with Token Analysis. In summary, the root cause is that the prefill of requests from other tenants interrupted the decode procedure of the current request. A possible solution is to perform PD separation to prevent the prefill and decode of different requests from affecting each other.

Case 2: Token-level observation to accurately locate the root cause of irrelevant answers
The following case is a typical "irrelevant answer" case. For example, the user asks a medical question, but the Large Language Model (LLM) replies with a LeetCode solution.

You can open the Token Analysis page of the abnormal Trace as shown in the following graph, and we can see at a glance that the first Token is "begin_of_sentence". This Token is a special Token, abbreviated as BOS. It is used to separate two completely unassociated corpora. In other words, once BOS appears, the subsequent answer is completely unassociated with the previous prompt, and naturally the answer is irrelevant. Therefore, it is obvious that BOS should not appear in the answer under any circumstances. Then the problem is delimited to why this BOS appears. For this case, "begin_of_sentence" will not be displayed in the reply of the user, the engine log, or the gateway log. Instead, it will only be displayed as an empty string. Therefore, without Token Analysis, the localization procedure will become complicated. Later, we further investigated and discovered that the output of BOS is a bad case of the LLM. The solution is to adjust the model or wait for subsequent model version optimization and Update.

Use GenAI Utils to quickly implement LoongSuite GenAI SemConv
Background
In the previous text, we introduced the semantics modeling of LoongSuite GenAI SemConv in multiple dimensions such as Agent, Skill, and Token Level Inference in detail. However, for developers of various Instrumentation libraries that implement LoongSuite GenAI SemConv, they face a common engineering challenge:

Each GenAI framework Instrumentation library needs to implement a complete set of telemetry Collection logic—creating Spans, mounting semantics properties, recording Metrics, sending Events, and managing Context propagation—and this logic is highly repetitive among different framework Instrumentations. More importantly, when the semantics specification is iteratively upgraded (such as adding fields or adjusting the Span structure), if each Instrumentation library maintains its own implementation, the upgrade cost will increase exponentially.

Take an Agent framework Instrumentation as an example. If a common tool layer is not used, the developer needs to manually complete the following operations: create the invoke_agent Span and set SpanKind, mount dozens of properties such as gen_ai.agent.name, gen_ai.agent.id, and gen_ai.usage.input_tokens one by one, decide whether to collect the message Content based on the configuration, handle abnormal situations and set the Error Status, and record Duration and Token Usage Metrics. This boilerplate Code is similar in each Instrumentation library.

To solve this problem, we implemented GenAI Utils in the probe. As the engineering capability layer of LoongSuite GenAI SemConv, it encapsulates the complexity of the semantics specification into concise APIs, so that Instrumentation library developers only need to focus on "what Data to fetch from the framework", without worrying about "how to Output telemetry Data according to the specification". The following are some GenAI Utils implementations that we Support:

The corresponding implementation for LoongSuite Python is LoongSuite-utils-genai.
The corresponding implementation for LoongSuite JS is LoongSuite-utils-genai. Architecture Design The overall architecture of GenAI Utils follows the design principle of "layered decoupling and unified convergence":

*Core design concepts:
*
The Instrumentation layer only performs Data extraction: Each framework Instrumentation library intercepts framework invocations through Hook or Monkey-Patch, and populates the Data into the corresponding Invocation Data object, without directly operating the OTel API.

GenAI Utils unifies the convergence of telemetry Outputs: All Span Creation, property mounting, Metrics recording, Event sending, and Context Management are completed internally by ExtendedTelemetryHandler.

Only one modification is required for a specification upgrade: When new fields are added or the structure is adjusted in LoongSuite GenAI SemConv, you only need to modify the Span Utils and Metrics modules in GenAI Utils, and all downstream instrumentation libraries automatically take effect.

API Usage
GenAI Utils provides the corresponding Invocation data class and Context Manager method for each GenAI operation covered by LoongSuite GenAI SemConv. This forms a unified "populate data + hand over to Handler" programming model. Next, you can take the GenAI Utils tool library in Python as an example to see how to use it:

Step 1: Obtain a Handler singleton

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler  

handler = get_extended_telemetry_handler(  
    tracer_provider=tracer_provider,  
    logger_provider=logger_provider,  
)

ExtendedTelemetryHandler inherits from the upstream TelemetryHandler of OpenTelemetry (OTel) (which is responsible for basic Large Language Model (LLM) operations), and based on this, it extends the new operation types added by LoongSuite, such as Agent, Tool, Embedding, Retrieve, Rerank, and Memory. It also integrates multimodal asynchronous processing capabilities. This inheritance design ensures that no conflicts occur during synchronization with the upstream community code.

Step 2: Select the corresponding Invocation data class, and populate the business data
GenAI Utils defines the corresponding Invocation data class for each operation. Instrumentation library developers only need to populate it with the data fetched from the framework:

Step 3: Use Context Manager to complete telemetry outputs
You can take the typical Agent framework instrumentation as an example to see how to use GenAI Utils to quickly implement complete observability collection:

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai.extended_types import (
    InvokeAgentInvocation, ExecuteToolInvocation
)
from opentelemetry.util.genai.types import InputMessage, OutputMessage, Text
handler = get_extended_telemetry_handler()
# ========== Agent invocation ==========  
with handler.invoke_agent() as invocation:
    invocation.provider = "dashscope"
    invocation.request_model = "qwen-max"
    invocation.agent_name = "ShoppingAssistant"
    invocation.agent_id = "agent-001"
    invocation.input_messages = [
        InputMessage(role="user", parts=[Text(content="Recommend a laptop for me")])
    ]
    # ... Actually invoke the Agent framework ...  
    invocation.output_messages = [
        OutputMessage(
            role="assistant",
            parts=[Text(content="I will search for you. Please wait a moment...")],
            finish_reason="tool_calls"
        )
    ]
    invocation.input_tokens = 42
    invocation.output_tokens = 18
# ========== Tool execution ========== 
with handler.execute_tool() as invocation:
    invocation.tool_name = "search_products"
    invocation.tool_call_arguments = {"query": "laptop", "category": "electronics"}
    # ... Actually execute the tool ...  
    invocation.tool_call_result = {"products": [{"name": "MacBook Pro", "price": 12999}]}

In the preceding Code, the Developer does not directly perform an operation on any OpenTelemetry (OTel) API. Manual Creation of a Span, Settings of SpanKind, mount of the gen_ai.agent.name property, or record of Duration Metrics is not required. These are all automatically completed by ExtendedTelemetryHandler during the enter and exit procedures of Context Manager. If an exception is thrown during the invocation procedure, Handler automatically catches it and sets the error.type property and fault Status on Span. For the detailed usage procedure, you can see the References.

Currently supported instrumentation
Based on GenAI Utils, LoongSuite Python Agent has implemented instrumentation for the following GenAI frames and model services, which cover mainstream GenAI ecosystems domestically and internationally:

The core telemetry logic of these instrumentation libraries all reuses GenAI Utils for implementation. When new semantics are added to LoongSuite GenAI SemConv or specifications are adjusted, you can simply upgrade the opentelemetry-util-genai package, and all downstream instrumentation libraries can take effect uniformly.

Conclusion: From unified fields to unified infrastructure
The observability construction in the GenAI era has evolved from "adding log instrumentation for model invocations" to "establishing unified semantics for the full trace of Prompt, infer, retrieve, tools, and Agent". OTel has provided a standardized direction for this, and promotes the formation of GenAI observability capabilities through semantic specifications and instrumentation libraries.

The significance of Alibaba and Ant Group co-building the GenAI observability semantic specifications lies in further engineering, platformizing, and scaling this standardized direction. On the one hand, unified semantics are used to reduce business access costs. On the other hand, unified Data is used to drive the reuse of the observability platform, Analysis Service, and administration capabilities. The ultimate Target is not to "produce a specification document", but to enable all vendors and Users that use this set of specifications to truly achieve visibility, analyzability, administrability, and evolvability for rapidly growing GenAI applications.

Community
The publish of LoongSuite GenAI SemConv this time is just a beginning. In the future, we will continue to make efforts in the following aspects:

More agile: Quickly respond to domestic AI ecosystem demands, and continuously extend the plugin matrix.
More efficient: Provide more comprehensive multimodal processing, more Span/Metric Types, and updated semantic specifications through LoongSuite GenAI Utils.
End-to-end: Unified tracking of AI invocations and microservice invocations makes the full-trace observability of multiple Agents possible.
Collaboration with upstream: Discuss specification and implementation construction by holding regular meetings with upstream Maintainers, synchronize with upstream regularly, and contribute downstream practices back to the OpenTelemetry community.

If you are building an AI application and care about observability, you are welcome to try, provide feedback, and contribute. For LoongSuite GenAI SemConv and corresponding probe implementations, you can join the following DingTalk group for communication:

Related Links
[1] Loongsuite GenAI SemConv
https://github.com/alibaba/loongsuite-semantic-conventions-genai