DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

Alibaba Cloud Open-Sources LoongSuite: Crafting A High-Performance, Low-cost Observable Data Collection Suite for the AI Era

The Evolution of AI Agent Technical Architecture Reshapes Software Engineering Practices
In the field of AI Agent development, the evolution of technical architecture is reshaping the practices of software engineering. Developers can not only improve the efficiency of code generation through intelligent programming assistants such as Cursor, TONGYI Lingma, and Claude Code, but also build complete intelligent systems through professional AI Agent development frameworks. The tech ecosystem presents a multi-dimensional development: The implementation methods include both high-code schemes requiring deep coding and low-code platforms dragging through visual components. In terms of tech stacks, tools like Spring AI Alibaba in the Java ecosystem, and Dify and AgentScope in Python form a cross-language support system, with Python dominating thanks to its rich AI library ecosystem. Technological evolution has also spawned new development paradigms: the multi-agent dialogue framework of AutoGen and the modular component system of LangChain are lowering technical barriers for agent development.

We summarize the core capability system of agents into four key components: The perception layer integrates multimodal interaction capabilities, including Natural Language Processing (NLP), speech recognition, and video stream analysis. The decision center, composed of large models, enables unified scheduling of model calls through AI gateways (such as Higress) while handling traffic control and security protection. The memory mechanism stores user interaction history with context association capabilities. For tool integration, the MCP protocol has gradually standardized tool usage. Tools serve as an effective communication channel between AI Agents and the digital world of the traditional Internet era. The emergence of the MCP market enables centralized management and discovery of MCP tools, thereby facilitating efficient connection between agents and tools. At the same time, when the capability boundaries of a single agent are breached, multi-agent systems achieve collaborative computing through the A2A protocol. This distributed intelligent architecture can handle more complex task scenarios.


(Panoramic View of AI Toolchain)

As the development tool chain continues to improve, AI Agents need to be deployed after development. The diverse architectural patterns emerge from the differentiated requirements of agent execution environments: desktop agents for individual users (such as Cherry Studio and DeepChat) can extend their runtime to the cloud through cloud sandbox environments, while enterprise-service-oriented agents operate in cloud-native environments with resource isolation. Serverless architectures (such as function computing) can provide them with elastically scalable infrastructure. During the operation of AI Agents, several common capabilities require middleware support. These include dynamic Prompt management and MCP registration centers implemented through Nacos. Higress serves as a unified proxy for AI models and MCP Servers. Asynchronous task queues are supported by RocketMQ. State storage is provided by Redis. Together, these components form the technical foundation for agent operations. At the same time, the construction of the security system faces the dual challenges of data compliance and system protection. At the data governance level, it is necessary to establish a filtering mechanism and an audit trail system for sensitive information. For security vulnerabilities in the MCP protocol, technical measures such as sandbox isolation and tool signature authentication can be adopted to build a defense system. The observability platform collects key information such as agent and model calls, token consumption, and performance metrics to provide data support for system optimization and threat detection.

Observability: A Key Cornerstone for the Technological Development of AI Agents
As mentioned earlier, the development of AI agents has broken through the boundaries of traditional software engineering, and its non-deterministic decision-making mechanism and dynamic execution process have made revolutionary requirements for observability. For an intelligent agent, it involves multimodal data processing, large model inference, and toolchain calls, and the complexity of these processes is growing exponentially. This complexity is not only reflected at the technical architecture level but also profoundly affects aspects of core operation and maintenance (O&M) procedures such as system stability assurance, cost control, and compliance auditing.

The autonomous decision-making feature of AI agents distinguishes them from traditional software applications, involving complex interactions such as multi-modal data processing, large model reasoning, and tool invocation. When such non-linear workflows are applied to real business scenarios, exceptions in any link may trigger a chain reaction. On the other hand, when an agent engages in multiple rounds of interaction with a model, the intermediate process may result in surprisingly high token consumption. The agent may even fall into an endless loop, forming the so-called "Token Black Hole". In the absence of a tracing analysis mechanism, it is difficult for developers to pinpoint the root cause of service exceptions. Building end-to-end observable capabilities can provide a solid basis for decision-making.

The iterative upgrade of AI agents needs to be carried out on the premise of maintaining service continuity, which requires the establishment of a sound regression testing and evaluation system. Each prompt word or model change may cause unforeseen side effects. Every time an AI agent is modified and released, we need to evaluate the results of the agent execution, which is equivalent to "regression testing" of the AI agent. By collecting observable data during execution process, enterprises can build an automated evaluation framework to quantify the impact of new versions on service quality and avoid runaway version iteration risks.

With the continuous development of generative AI, observability is evolving from an O&M tool to a core component of AI application architecture. It is precisely because of this technical trend that the GenAI semantic convention promoted by the OpenTelemetry community is building cross-framework and cross-vendor standardized data specifications. Against this background, Alibaba Cloud has officially opened up the LoongSuite observability collection suite. Aligned with the technological trends of the AI era, it helps more enterprises efficiently establish observability systems using standardized data specification models in a high-performance and low-cost manner.

LoongSuite: Creating a High-Performance and Low-Cost Observable Data Collection Suite for the AI Era
LoongSuite (/lʊŋ swiːt/) (phonetically translated as "Loong-sweet"), as the core carrier of the next-generation observability technology ecosystem, features a core data collection engine that effectively combines host-level probes with process-level instrumentation. The process-level probes enable fine-grained observability data collection within applications, while the host probes achieve efficient and flexible data processing and reporting. Additionally, capabilities for out-of-process data collection are realized through technologies such as eBPF.


(LoongSuite Technology Application Architecture)

At the process-level data collection level, LoongSuite builds enterprise-level observation capabilities for mainstream programming languages such as Java, Go, and Python. Through the deep adaptation of language features, the collector can automatically capture function call links, parameter transmission paths, and resource consumption, enabling accurate runtime state collection without modifying business code. This non-intrusive design is especially suitable for technical environments with frequent dynamic updates, ensuring the integrity of observed data while avoiding interference with core business logic. When dealing with complex workflows, the system can automatically correlate distributed tracing contexts and construct a complete execution path topology. As the core data collection engine, LoongCollector enables unified processing of multi-dimensional observation data. From raw data collection to structured conversion and then to intelligent routing and distribution, the entire process is flexibly orchestrated through a modular architecture. This architecture allows observation data to connect with open-source analysis platforms for autonomous governance, and also seamlessly integrate with managed services to build a cloud-native observability system. In terms of technology ecosystem construction, Alibaba Cloud is deeply involved in the formulation of international open source standards, and its core components are compatible with mainstream standards such as OpenTelemetry. Next, we will introduce the relevant components one by one.

LoongCollector
As a new-generation observability data collector, LoongCollector provides high-performance, high-stability observability data collection and preprocessing solutions for cloud-native intelligent computing services through in-depth performance optimization and technical architecture innovation. LoongCollector shows significant advantages especially in AI scenarios.

First, LoongCollector has multi-dimensional observable data collection capabilities, supports the unified collection, processing, and transmission of multiple types of data, such as Logs, Metrics, Traces, Events, and Profiles, enabling an All-in-One observability management architecture. LoongCollector integrates capabilities such as real-time log collection, Prometheus metric pulling, and eBPF technology, enabling non-intrusive monitoring without modifying system code. It efficiently acquires various performance metrics and is particularly suitable for integrated observability needs in large-scale distributed training and inference tasks.

Second, LoongCollector excels in performance and stability. It adopts technologies such as event-driven architecture, time-slice scheduling, and lock-free design, ensuring low resource consumption and high throughput even in scenarios with high concurrency and large-scale data collection. Meanwhile, its high-low watermark feedback queue mechanism and persistent caching capabilities equip it with robust traffic control and fault tolerance, ensuring no data loss, uninterrupted collection, and stable service. This fully meets the stringent requirements for stability, continuity, and reliability during AI training.

Furthermore, in AI scenarios, LoongCollector supports multiple deployment modes, including agent mode and cluster mode, and can flexibly adapt to the elastic requirements of distributed training and inference tasks. It features capabilities such as automatic container context discovery, Kubernetes (K8s) metadata association, and multi-tenant isolation, ensuring efficient and secure data collection in complex cloud-native environments. At the same time, through the configuration management service ConfigServer, it enables centralized management and control of large-scale agents as well as dynamic delivery of configurations, which significantly enhances operational efficiency and system controllability.

In addition, LoongCollector realizes the unified processing capability of multi-dimensional observation data. Raw data collection, structured conversion, data filtering and aggregation, and routing distribution: all processes are flexibly orchestrated in a modular way and can be expanded on demand. LoongCollector supports dual-engine drive of the SPL query language and multi-language plug-ins. It also provides a wide range of built-in data processing operators to meet diverse and high-throughput data preprocessing scenarios.

In summary, with its comprehensive data collection capabilities, excellent performance, flexible deployment methods, and powerful programmability, LoongCollector has become the core infrastructure for building observability systems in AI scenarios, helping enterprises achieve efficient and stable O&M of intelligent computing services.

LoongSuite Python Agent
LoongSuite Python Agent is built based on OpenTelemetry Python Agent. As the OTel community is still working on GenAI semantic specifications, the support of many AI frameworks has not been fully realized. At present, only OpenAI plug-ins can support observable data collection, which is far from the popular AI frameworks in China. LoongSuite Python Agent, as the latest implementation of OTel GenAI semantic specifications, adds support for domestically popular plugins while adhering to open-source semantic specifications. For instance, early support has been provided for AI programming frameworks such as AgentScope and Agno, prevailed in China. Support for more plugins, including those for Dify, Langchain, and MCP Client, will be made open-source sequentially, and these plugins will be contributed back to the OTel community. The Python agent allows us to easily collect various data, such as detailed information and time consumption, during the process where the AI agents call models and tools. With the help of the OTel project, these data can be reported to any storage in the form of the standard OTLP protocol and displayed through a visual interface.

LoongSuite Go Agent
LoongSuite Go Agent provides non-intrusive observation capabilities for AI agents built in Go through compile-time instrumentation technology. By deeply parsing the compilation process of the Go language, monitoring logic is implanted during the Abstract Syntax Tree (AST) analysis phase, enabling the injection of observability capabilities without modifying the source code. The LoongSuite Go Agent uses the compiler enhancement mechanism to automatically inject statistical logic, such as span creation and token consumption, during compilation by using a predefined tracking rule engine. It has built-in full support for mainstream development frameworks. Covering over twenty core modules including HTTP, gRPC, and database connections, the system spans everything from basic communication protocols and middleware interactions to microservice governance and data persistence. It can automatically capture key metrics such as request latency distribution, service call topology, and resource competition status. This out-of-the-box design significantly lowers the deployment threshold of the observation system, enabling developers to focus on business logic rather than infrastructure configuration. LoongSuite Go Agent can accurately capture the input and output characteristics of large model calls, token consumption patterns, and process trajectories of multi-round interactions, providing a data basis for optimizing resource utilization. Currently, the supported AI Agent development frameworks include LangChainGo and MCP Server. Eino, Ollma, and other frameworks will also be released sequentially.

LoongSuite Java Agent
Based on the OpenTelemetry Java Instrumentation project, LoongSuite Java Agent provides a full-link observability solution for Java applications through bytecode enhancement technology. Leveraging the ability to dynamically modify Java bytecode, it enables connection to the observability system for distributed tracing, metric collection, and log correlation without the need for manual modification of business code. With extremely low performance overhead, it provides fine-grained runtime data collection capabilities, adapting to the full-spectrum observability needs from traditional monolithic applications to cloud-native microservices. From basic development frameworks such as Servlet, Spring, and Dubbo, to middleware such as Redis, Kafka, and MySQL, and then to the JVM's own performance metric collection, the system has covered the automatic tracking points of more than fifty common components. It can automatically capture key data such as call link topology, method execution time, exception stack, and resource consumption. This plug-and-play feature greatly reduces the technical threshold for observability access, allowing developers to obtain a comprehensive view of system operation without the need to pay attention to the details of tracking points. For high-concurrency scenarios, its built-in sampling policy and data aggregation mechanism can effectively control the data volume while ensuring the observation accuracy, meeting the high availability requirements of the production environment. At present, large-scale production has been carried out on the large-scale model platform, Model Studio. The optimization of data collection in large-scale model scenarios accumulated in these processes will be released to open-source warehouses. In addition, automatic tracking support is being provided for common model access SDKs such as OpenAI and DashScope. The community is also welcome to contribute more plug-in implementations.

Loongsuite and Spring AI Alibaba Jointly Build the AI Application Ecosystem
As an integrated product of the Spring ecosystem and big model capabilities, Spring AI provides an abstract encapsulation of LLM and easy-to-use APIs in the Java language. At the same time, it fully embraces the OpenTelemetry standard in observability design and provides native observability capabilities for key calls. Spring AI Alibaba is an AI Agent development framework built by Alibaba on the basis of the Spring AI project. It deeply integrates the capabilities of Model Studio, providing many visual, tool-driven operational capabilities such as console and Graph, and various out-of-the-box pre-implemented agents. The core goal of Spring AI is to enable developers to quickly integrate and use AI capabilities in the Spring way. Therefore, like Spring, observability is integrated as an important component within the framework.

In terms of observability, Spring AI provides the following key capabilities:

● Automatic Tracking: Spring AI automatically tracks all critical paths that involve LLM calls, prompt creation, and streaming response processing, and generates Span tracks that meet the OpenTelemetry standard.

● Context Propagation: Spring AI supports the automatic injection and extraction of Trace ID and Span ID in the call chain, ensuring seamless connection with the call links of upstream and downstream services.

● Metric Export: Spring AI has built-in capabilities for collecting and exporting key performance metrics such as request latency, token usage, and model response length.

● Log Association: Spring AI injects the current Span context into logs through MDC or structured logging mechanisms, facilitating full-stack analysis during troubleshooting.

These capabilities enable Spring AI to implement complete tracking, monitoring, and log linkage without additional development when accessing the observation system. To further improve observability coverage and reduce access costs, Spring AI Alibaba supports deployment in conjunction with LoongSuite Java Agent. Java Agent can non-intrusively enhance the bytecode of running JVM applications, so as to implement automatic tracking of common components such as the Spring framework, database access, and HTTP requests.

Future Plans for the LoongSuite Project
Faced with the multitude of AI Agent frameworks, LoongSuite will provide comprehensive observability data collection capabilities for mainstream AI Agents on the market. This includes frameworks in the Python ecosystem, such as the low-code platform Dify, high-code frameworks like AgentScope, Agno, and OpenAI Agent, as well as those in the Java ecosystem, such as Spring AI Alibaba and its derivative low-code/no-code Agent JManus, and frameworks in the Golang ecosystem, such as Eino and Langchain4go. Developers interested in the community are also welcome to join and contribute support for more frameworks.
In the future, Agents will use a large number of tools, and multi-agent collaboration will become the norm. LoongSuite will eliminate the observation blind spots of MCP and multi-agent communication, address the MCP token black hole, and realize observable coverage of MCP and A2A protocols.
During the testing and online operation phases after an AI Agent is developed, its behavior needs to be fully evaluated. The evaluation capabilities are gradually becoming an indispensable part of the AI Agent lifecycle. By integrating with projects such as Spring AI Alibaba and AgentScope, an open-source observability tracking and evaluation console will be released, achieving full-lifecycle coverage of AI Agent from collection, storage, to evaluation.
LoongSuite will achieve end-to-end observability coverage, connect the entire chain from edge-side agents to the interior of models, and enable complete analysis and rapid diagnosis of AI Agent chains.
LoongCollector supports profiling in CPU and GPU scenarios through eBPF. LoongSuite will also work with the SysOM community to launch profiling in AI scenarios.
Open Source Community Participation&Contribution
As the world's leading cloud service provider, Alibaba Cloud has always been committed to the forefront of open source observation technology. We are deeply involved in the OpenTelemetry (OTel) community and firmly participate in and support the construction of an open technology ecosystem and the development of global technical standards. Over the past few years, Alibaba Cloud has actively promoted technology sharing and code contribution in the OpenTelemetry community, and has been deeply involved in many key areas of the community, such as Semantic Conventions (observable standards and specifications), Java Instrumentation (Java probes), Go Instrumentation (Go probes), and Profiling (performance analysis). To date, we have accumulated and merged over 1,000 PR Reviews and more than 400 Pull Requests to the community. In this open-source contribution process, we have successfully cultivated 3 Maintainers, 5 Approvers, 1 Triager, and 8 Members, injecting strong impetus into the technological evolution and ecosystem development of the community.

In addition to technical contributions, Alibaba Cloud also fully implements the spirit of sharing and cooperation advocated by the open-source culture, and actively promotes the vigorous development of new technologies and new ideas. For example, we have actively shared our technical achievements in global industry conferences such as KubeCon and OTel Community Day. At the same time, we have initiated and set up friendly exchange sessions for the Asia-Pacific region in the community, which have effectively promoted cross-regional technical exchanges and in-depth cooperation with the community. Join us in the OTel community and LoongSuite. The LoongSuite open-source code repository is as follows. Contributions are welcome:

LoongCollector: https://github.com/alibaba/loongcollector
LoongSuite Python Agent: https://github.com/alibaba/loongsuite-python-agent
LoongSuite Go Agent: https://github.com/alibaba/loongsuite-go-agent
LoongSuite Java Agent: https://github.com/alibaba/loongsuite-java-agent

Top comments (0)