Introduction
Focusing on LoongCollector, the core component of the LoongSuite ecosystem, this article conducts an in-depth analysis of the component's technological breakthroughs in intelligent computing services. LoongCollector covers multi-tenant observation isolation, GPU cluster performance tracking, and event-driven data pipeline design. Through zero-intrusion collection, intelligent preprocessing, and adaptive scaling mechanisms, LoongCollector builds a full-stack observability infrastructure for cloud-native AI scenarios and redefines the capability boundaries of observability in high-concurrency and strongly heterogeneous environments.
1. LoongCollector in LoongSuite Ecosystem
In the AI Agent technology system, observability serves as a core cornerstone. By real-time collecting key data such as model call links, resource consumption, and system performance indicators, observability provides the decision-making basis for intelligent agents in terms of performance optimization, security risk control, and fault location. It not only supports visual monitoring of core processes such as dynamic Prompt management and task queue scheduling, but also serves as a key infrastructure for ensuring the trustworthy operation and continuous evolution of AI systems by tracking the flow of sensitive data and abnormal behaviors.
LoongSuite (/lʊŋ swiːt/) (phonetically translated as "Loong-sweet") is a high-performance and low-cost observability collection suite open-sourced by Alibaba Cloud for the AI era. It aims to help more enterprises efficiently establish an observability system through high-efficiency, cost-effective methods, enabling them to better acquire and utilize standardized data specification models.
The LoongSuite boasts the following unique advantages:
Zero-intrusion collection: By combining process-level instrumentation and host-level probes, it can capture full-link data without modifying the code.
Full-stack support: It covers mainstream programming languages such as Java, Go, and Python, and is adapted to cloud-native AI scenarios.
Ecosystem compatibility: It is deeply compatible with international standards such as OpenTelemetry and supports open-source or cloud-hosted analysis platforms.
LoongCollector serves as the "heart" of the LoongSuite and its core data collection engine, boasting three key capabilities:
Unified collection capability for multi-dimensional data. LoongCollector is essentially an all-in-one architecture collector that supports the collection of all data types, including Logs, Metrics, Traces, Events, and Profiles. It also enables out-of-process data collection through non-intrusive technologies such as eBPF, reducing interference with business operations.
Ultimate performance and stability. LoongCollector adopts time-slice scheduling and lock-free design to achieve low resource consumption and high throughput in high-concurrency scenarios. Additionally, it features high-low watermark feedback queues and persistent caching to ensure no data loss and stable service operation without fluctuations.
Flexible deployment and intelligent routing. LoongCollector acts as a data transmission bridge between other observability components and downstream storage. Boasting support for flexible deployment and orchestration, it carries out unified processing and structural transformation of multi-dimensional raw observability data from diverse in-cluster data sources or other agents, before finally conducting intelligent routing and unified distribution.
The position of LoongCollector within the LoongSuite components can be summarized as shown in the following diagram:
2. Core Advantages of LoongCollector
LoongCollector is able to play a crucial role in the LoongSuite components precisely because it boasts numerous core advantages.
LoongCollector is a data collector that integrates exceptional performance, superior stability, and flexible programmability, designed specifically for building next-generation observability pipelines. The vision is to build an industry-leading "Unified Observability Agent" and "End-to-End Observability Pipeline".
LoongCollector originates from the iLogtail project, an open-source initiative by Alibaba Cloud Observability Team. Building on iLogtail's robust log collection and processing capabilities, LoongCollector has undergone comprehensive functional upgrades and expansions. It has gradually expanded from the original single log scenario to an integrated entity encompassing observable data collection, local computing, and service discovery. Endowed with features such as extensive data access, superior performance, robust reliability, programmability, manageability, cloud-native support, and multi-tenant isolation, LoongCollector can well adapt to the demand scenarios of observable collection and preprocessing for intelligent computing services.
Telemetry Data, Boundless Possibilities
LoongCollector adheres to the all-in-one design philosophy and aims to enable a single agent to handle all collection tasks, including the collection, processing, routing, and transmission of Logs, Metrics, Traces, Events, and Profiles. LoongCollector emphasizes the enhancement of Prometheus metric scraping capabilities, deeply integrates Extended Berkeley Packet Filter (eBPF) technology to achieve non-intrusive collection, and provides native metric collection functions, thus realizing a true OneAgent.
Adhering to the principles of openness and open source, LoongCollector actively embraces open-source standards, including OpenTelemetry and Prometheus. Meanwhile, it supports connectivity with a wide range of open-source ecosystems such as OpenTelemetry Flusher, ClickHouse Flusher, and Kafka Flusher. As an observability infrastructure, LoongCollector continuously enhances its compatibility in heterogeneous environments and actively strives to achieve comprehensive and in-depth support for mainstream operating system (OS) environments.
The capability to handle Kubernetes (K8s) collection scenarios has always been a core strength of LoongCollector. As is well known in the observability field, K8s metadata (such as Namespace, Pod, Container, and Labels) often plays a crucial role in observability data analysis. LoongCollector interacts with the underlying definitions of Pods through the standard CRI API to obtain various metadata information in K8s, thereby achieving non-intrusive K8s metadata AutoTagging capability during collection.
Reliable Performance, Unmatched Excellence
LoongCollector always prioritizes the pursuit of ultimate collection performance and superior reliability, firmly believing that these form the foundation for practicing the concept of long-termism. This is mainly reflected in the relentless refinement of performance, resource consumption, and stability.
Continuous performance breakthroughs: LoongCollector adopts a single-threaded event-driven approach, combined with time-slice scheduling, lock-free technology, and zero-copy data stream processing, enabling it to achieve high performance while maintaining extremely low resource consumption.
Meticulous memory management: It uses Memory Arena technology to reduce the number of memory allocations and employs Zero Copy technology for core data streams to minimize invalid in-memory copies.
High-low watermark feedback queue: Traffic backpressure control mechanism with At-Least-Once semantic guarantee.
Pipeline multi-tenant isolation: Different data flows are isolated from each other with a priority scheduling mechanism, along with a multi-target transmission and throttling mechanism for network anomalies.
Persistent caching: Withstands short-term environmental anomalies and ensures no data loss.
Programming Pipeline, Unrivaled Prowess
LoongCollector builds a comprehensive programmable system empowered by the dual engines of SPL and multi-language Plug-in, providing robust data preprocessing capabilities. Different engines can be interconnected, and the expected computing capabilities can be achieved through flexible combinations.
Developers have the flexibility to choose a programmable engine based on their needs. If execution efficiency is a priority, native plug-ins can be chosen. If comprehensive operators and the need to handle complex data are valued, the SPL engine is an option. If low-threshold customizations are emphasized, extended plug-ins can be selected, with programming done in Golang.
Configuration Management, Effortless Control
In the complex production environment of distributed intelligent computing services, it is a serious challenge to manage the configuration access of thousands of nodes. This especially highlights the lack of a set of unified and efficient control specifications in the industry. To address this issue, the LoongCollector community has designed and implemented a detailed agent management and control protocol. This protocol aims to provide a standardized and interoperable framework for agents of different origins and architectures, thereby facilitating the automation of configuration management.
The ConfigServer service, implemented based on this management and control protocol, can manage any agent that complies with the protocol, significantly enhancing the uniformity, real-time performance, and traceability of configuration policies in large-scale distributed systems. As a management and control service for observable agents, ConfigServer supports the following functions:
It uniformly manages data-collecting agents in the form of agent groups.
It remotely configures the collection settings of data-collecting agents in batches.
It monitors the running status of data-collecting agents and aggregates alarm information.
Industry Comparison
In the field of observability, Fluent Bit, OpenTelemetry Collector, and Vector are all highly regarded observable data collectors. Among them, FluentBit is lightweight and robust, known for its performance. OpenTelemetry Collector, backed by CNCF, has built a rich ecosystem based on OpenTelemetry concepts. Vector, supported by Datadog, offers a new option for data processing through the combination of Observability Pipelines and VRL. On the other hand, LoongCollector is based on log scenarios and provides more comprehensive OneAgent collection capabilities by continuously improving metrics and tracking. It relies on the advantages of performance, stability, pipeline flexibility, and programmability to differentiate its core capabilities. At the same time, it provides large-scale collection and configuration management capabilities with powerful control capabilities. For more information, see the following table. Green sections denote advantages.
3. Observability Requirements, Adjustments, and Practices in Intelligent Computing Scenarios
Observability Requirements and Challenges of Intelligent Computing Services
As established earlier, container services based on cloud-native architecture have gradually become the foundational infrastructure supporting AI intelligent computing. With the rapid development of AI task scale, especially the number of parameters of large models jumping from the billion to the trillion level, the rapid expansion of training scale not only triggers a significant increase in cluster costs, but also poses challenges to system stability. Typical challenges include:
GPUs are expensive and have a high rate of defective cards. Automatically detecting faulty GPUs to enable self-healing of tasks and environments is a fundamental requirement for ensuring uninterrupted AI production tasks.
Questions such as how to adjust model parameters and why tasks run slowly all require more transparent observability capabilities to aid in model performance optimization and parameter tuning.
In production environments, such as during inference tasks, it is crucial to ensure the environmental stability of clusters and AI business tasks.
To effectively address these challenges, building an observability data-driven cloud-native intelligent computing service architecture becomes an urgent task. Corresponding to the hierarchical architecture of the intelligent computing service system, the observable system is also divided into three levels: observability of IaaS layer cloud resources, observability of CaaS layer containers, and observability of PaaS layer model training/inference.
When building an observability system for intelligent computing services, a key requirement is how to adapt to the characteristics of the cloud-native AI infrastructure for effective observability data collection and preprocessing. The main challenges in this process include, but are not limited to:
Heterogeneous Attributes
Resource heterogeneity: Heterogeneity in computing, storage, network, and other resources, with high requirements for data richness and timeliness, such as GPU, CPU, RDMA, and CPFS.
Data heterogeneity: Observability metrics and logs generated from cluster components to model applications vary widely and are diverse in types.
Large Cluster Size
Distributed training involves collaborative work across multiple nodes, requiring consistency of the observable data.
Multi-cluster/cross-region training demands stability to address the risks of network instability.
In large-scale multi-cluster scenarios, the controllability of data collection tasks must be ensured.
Elasticity: Frequent addition and removal of workloads, uncertain lifecycles, and large traffic bursts.
Distributed training features complex models, large training parameters/data, and an expansion speed of 10K Pods per minute.
In the process of distributed training, factors such as node failure tolerance and resource changes require elasticity to ensure the continuity of training.
Distributed inference (tidal business) is scaled in or out as traffic changes.
Multi-tenancy
Isolation of observable data collection.
Priority scheduling of collection tasks.
Therefore, there is an urgent need for a robust observability pipeline adapted to cloud-native intelligent computing services. This pipeline should integrate observable data collection and preprocessing, featuring comprehensive data collection capabilities, flexible data processing, strong elasticity, high performance, low resource overhead, stability and reliability, support for multi-tenancy, strong management and control capabilities, and user-friendliness.
To address this, LoongCollector is precisely such an observable data collector that integrates observable data collection and preprocessing, featuring strong elastic scaling capabilities, high performance, low overhead, user-friendly multi-tenancy management, and stability and reliability. Next, this article will explain how LoongCollector addresses and responds to these challenges.
Practice of LoongCollector in Intelligent Computing Services
As a high-performance observability data collection and preprocessing pipeline, LoongCollector works in the following modes in intelligent computing clusters:
Agent Mode: As An Agent
LoongCollector runs as an agent on the nodes of the intelligent computing cluster, with each LoongCollector instance dedicated to collecting multi-dimensional observable data of the node it resides on.
It makes full use of local computing resources to realize real-time processing at the data source, reducing the latency and network traffic caused by data transmission, and improving the timeliness of data processing.
LoongCollector features adaptive capabilities to dynamically scale with nodes, ensuring seamless elastic scaling of observable data collection and processing capabilities as the cluster scale evolves.
Cluster Mode: As A Service
LoongCollector is deployed on one or more core data processing nodes, with multi-replica deployment and support for scaling in/out. It is used to receive data from in-system agents or open-source protocols and perform operations such as conversion and aggregation.
As a centralized service, it facilitates grasping the context of the entire system, strengthens the capability of correlation analysis for cluster metadata, and lays a foundation for an in-depth understanding of system status and data flow.
As a centralized service hub, LoongCollector provides capabilities for cluster data scraping and processing, such as Prometheus metric scraping.
Distributed Metric Collection
Given the complexity and diversity of the intelligent computing service system architecture, it is necessary to monitor a variety of key performance indicators. These metrics range from the infrastructure level to the application level, and provide external data interfaces in the form of Prometheus Exporter. For example, there are Node Exporter for computing node resources, NVIDIA DCGM Exporter for GPU devices, kube-state-metrics for clusters, and TensorFlow Exporter and PyTorch Exporter for training frameworks.
LoongCollector natively supports the capability to directly scrape various metrics exposed by Prometheus Exporters, adopting a Master-Slave multi-replica collection mode.
The Master function is carried by the LoongCollector Operator, which provides the Target Allocator capabilities based on service discovery results to implement capabilities such as Worker load balancing, horizontal scaling, and smooth upgrades.
Worker nodes are hosted by LoongCollector, which uses the Pipeline architecture to capture and process metrics based on the results of Target Allocator.
Based on the metric results collected by the intelligent computing service, GPU usage monitoring and faulty card status detection can be performed through a visual dashboard. At the same time, for high-throughput scenarios, you can quickly find the bottleneck of multi-cluster and multi-card AI training, to better improve the utilization of resources such as GPUs.
Distributed Log Collection
In the scenario of log collection using intelligent computing clusters, LoongCollector provides flexible deployment methods according to business requirements.
DaemonSet mode: A LoongCollector is deployed on each node in the cluster, responsible for collecting logs of all containers on that node. Its features include simple operation and maintenance, low resource consumption, and flexible configuration methods. However, it has weak isolation.
Sidecar mode: A LoongCollector container runs alongside the business container in each Pod, used to collect logs generated by the business container in that Pod. Its features include superior multi-tenant isolation and high performance. But it consumes more resources.
Whether it is distributed training or inference service deployment, they both have strong elasticity characteristics. LoongCollector is well adapted to the requirements of elasticity and multi-tenancy:
Container auto-discovery: Obtain container context information by accessing the socket of the container runtime (Docker Engine/ContainerD) located on the host machine.
Container-level information: Container name, ID, mount point, environment variable, and Label.
K8s-level information: Pod, namespace, and Label.
Container filtering and isolation: Based on container context information, it provides the capability of filtering containers for collection, which not only ensures the isolation of collection but also reduces unnecessary resource waste.
Metadata association: Based on container context information and container environment variables, it provides the capability to enrich K8s metadata in logs.
Collection path discovery
Standard output: It can automatically identify the standard output format and log path of different runtimes based on container metadata, without the need for additional manual configuration.
In-container files: For overlay and overlay2 storage drivers, the collection path is automatically concatenated according to the log type and container runtime.
Data processing solution: Container context association and data processing
In intelligent computing service scenarios, to maximize the value of data, it is usually necessary to efficiently collect and transmit cluster logs, distributed training logs, and inference service logs to the backend log analysis platform, and on this basis, implement a series of data enhancement strategies. For distributed training logs, it is necessary to associate container context, which should include container ID, Pod name, namespace, and node information, to ensure the tracking and optimization of AI training tasks across containers. For inference services, in order to conduct more efficient analysis on access traffic and other aspects, it is necessary to perform field standardization processing on logs while also associating the container context.
LoongCollector, with its powerful computing capability, can associate K8s metadata with the accessed data. Meanwhile, by utilizing SPL and multi-language plug-in computing engines, it provides flexible data processing and orchestration capabilities, facilitating the handling of various complex formats.
The following are some typical log processing scenarios:
Multi-line log splitting for distributed training: Training anomalies often involve call stack information, which is usually presented in multi-line form.
Container context association for distributed training and inference services: Facilitating the tracking of abnormal training tasks and online services.
Log context sequential viewing: Logs collected into the log analysis system will not be out of order.
eBPF collection
In the distributed training framework, multiple computing nodes work together to accelerate the model training process. However, in actual operations, the overall system performance may become unstable due to the impact of various factors, such as network latency, bandwidth limitations, and computing resource bottlenecks. These factors may lead to fluctuations or even a decline in training efficiency. LoongCollector implements non-intrusive network monitoring in intelligent computing services through eBPF technology. By capturing and analyzing traffic in real-time, it identifies the cluster network topology, enabling rapid detection of anomalies and thereby improving the efficiency of the entire model training process.
Self-observability of Pipeline
LoongCollector serves as the infrastructure for observable data collection, and the importance of its stability is self-evident. However, the operating environment is complex and variable. Therefore, emphasis has been placed on building observability for its operating status, facilitating the timely detection of collection anomalies or bottlenecks in large-scale clusters.
Monitoring of overall operating status, including CPU, memory, and startup time.
All collection pipelines have complete metrics, enabling statistics and comparison of different collection configurations at dimensions such as Project and Logstore.
All plug-ins have their metrics, which makes it possible to construct a complete pipeline topology map. The status of each plug-in can be clearly observed.
4. Exploring the Future of Data Collection
In the future, LoongCollector will continue to build around long-termism and build core competitiveness to meet the needs of the rapid development of the AI era.
We will further optimize the performance and enhance the stability of LoongCollector through methods such as C++, framework optimization, memory control, and cache.
By further enhancing capabilities such as Prometheus scraping support, deep eBPF integration, and adding host metric collection, we will achieve a better All-in-One agent.
We will also implement a series of optimizations to make LoongCollector more automated and intelligent, so that it can better serve the AI era.
LoongCollector will not only be a tool, but also a cornerstone for building intelligent computing infrastructure. You can try it out on GitHub, participate in LoongCollector and other projects of LoongSuite, and work together to illuminate the future of AI with observability.
Top comments (0)