Overview
As enterprises expand their global businesses, a key challenge emerges. It involves efficiently, economically, and reliably collecting logs from overseas applications and infrastructure into Alibaba Cloud Simple Log Service (SLS) for analysis and monitoring.
This article focuses on the application of high-performance log collection agents (iLogtail and LoongCollector) from Alibaba Cloud in overseas scenarios. It delves into how to design optimal network access links for different deployment environments, including on-premises data centers, cross-cloud platforms, and Alibaba Cloud environments. We recommend LoongCollector first because it is more reliable, especially in multi-target transmission scenarios. In this article, a variety of network solutions are analyzed in detail, including direct internet connection, Global Accelerator (GA) optimization, Alibaba Cloud internal network, leased line, CEN, and VPN access.
In addition, this article highlights cost optimization strategies, including using CloudLens for SLS to diagnose usage and migrating public links to private networks to reduce cost. At the same time, the article also elaborates on two core multi-target transmission configurations: one is to implement cross-region disaster recovery (data redundancy) by using agent dual-write, and the other is to implement multi-region or multi-target log distribution (on-demand routing).
This article aims to provide comprehensive practical guidance and configuration reference for enterprises to build a stable, low-cost, and highly available global log system.
1. Background and Challenges
With the expansion of global business, log data is the cornerstone of observability, troubleshooting, and compliance auditing. The unified collection and analysis of log data has thus become crucial. However, enterprises generally face the following key challenges when connecting logs distributed around the world (including local data centers, third-party cloud service providers, and Alibaba Cloud overseas regions) to Alibaba Cloud SLS. These challenges are directly related to access link selection, overall cost control, and system high availability:
Link quality and stability challenges
The quality of overseas public network links varies, and issues such as high latency, high jitter, and packet loss are common. These problems affect the real-time performance and integrity of log transmission and pose severe challenges to the reliability and efficiency of access links. How to select and optimize the network path is the primary problem to ensure stable data transmission.
Pressure of cost control
The outbound traffic fee generated by the transmission of a large number of logs over the Internet is one of the main costs. Especially in the multi-cloud and global deployment, how to effectively reduce data transmission costs has become an urgent need. Unreasonable link selection or lack of optimization strategies can lead to cost runaway.
High availability and disaster recovery requirements
As a critical infrastructure, log systems demand utmost availability. When a single collection link or target SLS project is at risk of a single point of failure, the core requirement for ensuring business continuity is to achieve cross-region and cross-zone disaster recovery through multi-target transmission, and to ensure that log data remains intact and services remain uninterrupted in various failure scenarios.
Multiple environments and compliance complexity
Log sources are distributed across on-premises data centers, multi-cloud environments, and different regions of Alibaba Cloud. This increases the complexity of agent deployment, configuration, and management in a unified manner. In addition, cross-border data transmission must meet local data security and compliance requirements, such as data localization. This adds additional constraints to link design and data processing.
2. Core Data Collection Tools: Advantages of iLogtail and LoongCollector
In dealing with the above challenges, the selection of an appropriate collection agent is crucial. As the agents recommended by Alibaba Cloud, both iLogtail and LoongCollector have powerful basic capabilities. However, LoongCollector provides further enhancements in terms of reliability and some advanced features.
Core advantages shared by both:
● Lightweight and efficient: With a C++ core, they have low resource usage (CPU, memory) and minimal impact on business server performance.
● All-purpose collection: They support multiple data sources such as file logs, container logs (stdout, stderr, and files), Syslog, and HTTP.
● Powerful processing (processor plug-ins): They support preprocessing such as parsing (JSON, Regex, and delimiters), filtering, and desensitization of data on the agent, thereby reducing invalid data transmission from the source, optimizing costs, and improving backend processing efficiency.
● Transmission compression: The built-in LZ4 compression algorithm can significantly reduce network traffic.
● Highly reliable transmission: They are equipped with mechanisms such as local disk caching, resumable transmission, retry upon failure, and traffic shaping. These mechanisms effectively handle network fluctuations and ensure that data is "neither lost nor duplicated."
● Flexible output and multi-target: They support sending data to SLS, and a single agent instance can be configured to send the same piece of data to multiple target SLS endpoints simultaneously (for dual-write disaster recovery or migration).
● Cloud-native and ecosystem integration: They are deeply integrated with Alibaba Cloud ECS, container services such as ACK and ASK, and support convenient deployment (DaemonSet, Sidecar) and configuration management (CRD) in Kubernetes environments.
Reasons to prioritize LoongCollector:
Enhanced reliability - network anomaly isolation:
LoongCollector has higher reliability than iLogtail. A significant advantage is the support for a mechanism that isolates network anomalies on the sender side.
When configuring the agent to send logs to SLS endpoints in multiple regions (such as Singapore and Hangzhou), if a network connection anomaly (such as timeout or interruption) occurs in one of the target regions (maybe Singapore), LoongCollector will intelligently isolate the data transmission link to that abnormal region.
This means that a failure to send data to Singapore does not block or affect data transmission to Hangzhou or other normal regions. This significantly improves the overall stability and timeliness of data transmission in multi-target output scenarios, and prevents the problem where "a single point of failure affects the entire system".
Conclusion: Although iLogtail remains a powerful agent, for users with high reliability requirements, especially those involving complex scenarios such as multi-region log transmission, it is strongly recommended to prioritize LoongCollector, as it offers superior stability and network fault tolerance. LoongCollector is in the process of canary release. If you have more questions, please contact us through tickets.
Recommended versions:
iLogtail: v2.1.7 or later
LoongCollector: v3.0.9 or later
3. Link Design: Build an Efficient and Stable Data Tunnel
Depending on the business deployment location, network conditions, cost budget, and requirements for latency, stability, and security, you can choose different access link solutions:
Solution 1: Directly connect to public endpoints
Architecture: Overseas server (agent) -> Internet -> SLS Internet endpoint (target region)
Advantages: Simplest configuration. No need to purchase additional Alibaba Cloud networking services.
Disadvantages: High cost (Internet traffic fee required), poor network quality (high cross-border latency and instability), and HTTPS-based security.
Applicable scenarios: Scenarios where network fluctuations and latency are acceptable.
Solution 2: Optimize the Internet by GA
Architecture: Overseas server (agent) -> Internet -> the nearest Alibaba Cloud POP (GA acceleration endpoint) -> Alibaba Cloud backbone network-> SLS endpoint (target region, especially the mainland)
Advantages: It significantly improves cross-border public network transmission quality, reduces latency and packet loss rate, and is relatively simple to deploy (enable transmission acceleration in the SLS console and modify the data endpoint of the agent to the global acceleration endpoint).
Disadvantages: High cost, as it adds GA traffic fees. However, it may indirectly optimize the total cost by reducing source-side retransmissions and improving efficiency.
Applicable scenarios: Use cases involve long-distance log transmission (especially cross-border transmission to the Chinese mainland) and services with high requirements for log real-time performance and stability.
For configurations, see Manage Transmission Acceleration and How to Enable Network Transmission Acceleration Service.
Solution 3: Private network access for services in the same region on Alibaba Cloud (optimal cost and performance)
Architecture: Overseas Alibaba Cloud server (agent) -> Alibaba Cloud VPC-> SLS private endpoint (same region)
Advantages: Best network quality (low latency and high stability), highest security (data is stored within VPCs), and lowest cost (no Internet traffic fee in the same region).
Disadvantages: The business server and the target SLS project must be deployed in the VPC of the same Alibaba Cloud region.
Applicable scenarios: This solution is strongly recommended for all services deployed on Alibaba Cloud that need to collect logs to SLS in the same region.
Solution 4: In hybrid cloud or multi-cloud scenarios, connect to an intranet through a leased line or a Cloud Enterprise Network (CEN)
Architecture: Overseas server (agent in local data center or other cloud VPCs) -> Express Connect circuit, VPN, or CEN -> Alibaba Cloud VPC -> SLS private endpoint (target region)
Advantages: Security assurance for private network isolation is provided, with network quality and stability superior to those of the public network and GA (especially leased lines).
Disadvantages: Customers need to use features such as Cloud Enterprise Network (CEN) and Express Connect to connect networks. This costs the most (including leased line fees, CEN instance fees, and bandwidth plan fees or traffic fees). The configuration and deployment are relatively complex and take a long time.
Applicable scenarios: Local data centers (IDCs) or multi-cloud environments that need to establish high-quality private network connections with Alibaba Cloud. Core services with massive log volumes and extremely high requirements for real-time performance, stability, and security. Scenarios with sufficient budgets.
Cross-region internal access requires submitting a ticket.
Solution comparison summary:
4. Cost Optimization Strategies: Save Every Penny
In addition to selecting appropriate network links, the following strategies can further optimize costs:
4.1 Use CloudLens for SLS to Diagnose Log Volume and Internet Traffic Exceptions
As your business continues to grow, the amount of log data that is added to SLS increases. However, sometimes unexpected log volume or Internet traffic surges may occur. This is often due to the application printing too much redundant information, or the imprecise collection configuration (for example, unnecessary debug logs are collected, logs are collected repeatedly, or the collection path range is overly large to collect irrelevant log files). These situations can not only cause unimportant logs to occupy a large amount of resources, but also significantly increase the associated network transmission and storage costs.
To effectively monitor and diagnose these issues, Alibaba Cloud Simple Log Service (SLS) provides CloudLens for SLS. This service provides you with a centralized view that clearly displays the core resource consumption metrics of all projects in your account, including:
Log write volume:
The amount of real-time and historical data written to each project.
Outbound Internet traffic:
The traffic generated by writing data through the public endpoints.
GA traffic:
If the GA optimization link is used, relevant traffic will be displayed here.
By analyzing the reports provided by CloudLens (see View CloudLens Data Reports), you can quickly locate the projects or specific periods whose traffic usage is abnormal (excessive or sudden increase). This helps you detect and troubleshoot unexpected log writing behaviors in a timely manner. For example:
Identify which service or type of log is causing the traffic peak.
Evaluate whether the collection configuration needs to be optimized (such as adjusting the collection path).
Determine whether there is a log printing problem at the application level.
Based on these insights, you can take targeted optimization measures to control costs and improve resource usage efficiency. If you need further assistance while analyzing CloudLens data or troubleshooting specific issues, feel free to contact Alibaba Cloud technical support by submitting a ticket at any time.
References: How to Use CloudLens for SLS to Help You Analyze Resource Usage and View CloudLens for SLS Data Reports
4.2 Data Compression to Reduce Network Bandwidth Consumption
Log data, especially raw text logs, often contains a large amount of redundant information. Direct transmission over the network consumes considerable bandwidth resources. Especially in scenarios where logs are transmitted across regions, across countries, or over the Internet, the high cost of network bandwidth is an important part of the total cost of ownership (TCO) of the log system. To mitigate this issue, both LoongCollector and iLogtail, the data collection agents of Alibaba Cloud Simple Log Service (SLS), support client-side compression before data transmission.
Core compression technology: LZ4 algorithm
High speed and efficiency: The LZ4 compression algorithm is mainly used. LZ4 is a high-speed lossless compression algorithm. Its core advantage is that it provides extremely fast compression and decompression speed, while consuming relatively low CPU resources on the client side (that is, the server where logs are generated). This makes it ideal for real-time data stream processing scenarios such as log collection that require high throughput and low latency to avoid significant impacts on the performance of business applications.
LZ4 enabled by default: Users can automatically benefit from the network bandwidth savings brought by data compression without additional configuration.
Compression ratio: The compression ratio can usually reach 5 to 10 times.
Effect monitoring: Users can view relevant metrics such as write traffic compression ratio of log projects through CloudLens for SLS provided by Alibaba Cloud. This provides an intuitive way to quantify the actual network bandwidth savings from compression.
4.3 Optimized Log Reporting: Filter Non-essential Data to Reduce Costs and Improve Efficiency
In many scenarios, not all generated logs have the same analytical value. For example, DEBUG-level logs may not be necessary in production, or some frequently printed health check logs may not be meaningful for core business analysis. Uploading these low-value or redundant logs to SLS needlessly increases network transmission cost, SLS indexing and storage costs, and may interfere with the rapid location of critical information.
iLogtail and LoongCollector provide powerful processor plug-ins, which allow you to preprocess and filter collected logs on the agent. This way, unnecessary log entries are discarded before transmission.
You can use the following methods to filter logs on the agent based on your business requirements:
Processor filter plug-in:
You can use the native filter plug-in or extended filter plug-in provided by iLogtail or LoongCollector to configure matching rules to filter unnecessary logs.
SPL-based filtering:
For some complex filtering logic, you can use SPL to write processing logic to combine parsing and filtering.
References:
Native Plug-in: Filter Processing
Extended Plug-in: Filter Logs
SPL-based Regular Expression Parsing and Filtering
4.4 Smooth Migration: Switch Service Log Collection from the Internet to a Private Network
For scenarios where public Internet access was used in the past and now there is a desire to switch to a private network with lower costs and better performance, a smooth transition is often required to avoid monitoring interruptions. This typically involves configuring the agent for dual-write during the migration period. This section describes the following two typical scenarios.
Scenario 1 (Log localization): Originally, logs from all regions were centrally collected in Project A, and now it is necessary to migrate the logs in Region B to Project B in the same region. During the migration, the agent in Region B writes to both Project A (cross-Internet or GA) and Project B (intranet in the same region) to ensure monitoring continuity. After Project B runs stably, stop writing data to Project A.
Scenario 2 (Business region migration): Migrate your workloads from Region A to Region B. During the migration, some services are still in A, and some are already in B. To maintain a unified monitoring view (for example, in Project B), the agent of Region A needs to configure dual-write: one copy is written to Project A (intranet in the same region), and the other copy is written to Project B (cross-Internet or GA). After the service is completely migrated to B, stop writing to Project A (or vice versa, depending on the final monitoring center).
For specific steps, please refer to [Best Practices] Are You Still Transmitting Data Across Borders? Compliance Governance for Cross-border Data Transmission - Local Migration of Logtail Data Without Service Interruption
5. Multi-region Log Distribution: Send Different Logs to Different Destinations
In addition to sending the same copy of data to multiple regions for disaster recovery or migration (as in the dual-write scenario described in Section 4.3 and Part 5), there is another common requirement:
Different logs are sent to different destinations: Logs of different types or different sources from the same server or Kubernetes nodes are sent to SLS in different regions, respectively, for processing and analysis.
For example:
Business machine group in Singapore
System logs (such as /var/log/messages) are centrally collected into Project A (located in the Shanghai region) for unified operation and maintenance monitoring.
Business application logs (such as app.log) are collected into Project B (located in the Singapore region) in the region where the business is situated.
iLogtail and LoongCollector fully support this scenario, with the core implementation method being to configure multiple endpoints for the same agent instance. For specific steps, please refer to the following.
ECS
Suppose an ECS instance in Singapore needs to write logs to both the Shanghai and Singapore regions simultaneously.
On the SLS console, create Project A (located in the Shanghai region) and Project B (located in the Singapore region).
Create machine groups under Project A and Project B respectively.
On the server where the client is located, configure the ilogtail_config.json file of iLogtail or LoongCollector to support dual write.
The following provides a configuration sample of ilogtail_config.json for iLogtail or LoongCollector. Among them, LoongCollector is also compatible with the ilogtail_config.json configuration sample of iLogtail.
ilogtail_config.json configuration sample of iLogtail
{
...
"config_server_address":"http://logtail.ap-southeast-1-intranet.log.aliyuncs.com",
"config_server_address_list": [
"http://cn-shanghai.log.aliyuncs.com"
],
"data_server_list": [
{
"cluster": "ap-southeast-1",
"endpoint":"ap-southeast-1-intranet.log.aliyuncs.com"
},
{
"cluster": "cn-shanghai",
"endpoint":"cn-shanghai.log.aliyuncs.com" // If GA is required, replace this with log-global.aliyuncs.com
}
],
...
}
ilogtail_config.json configuration sample of LoongCollector
{
...
"primary_region" : "ap-southeast-1",
"config_servers" :
[
"http://logtail.ap-southeast-1-intranet.log.aliyuncs.com",
"http://logtail.cn-shanghai.log.aliyuncs.com"
],
"data_servers" :
[
{
"region" : "ap-southeast-1",
"endpoint_list": [
"ap-southeast-1-intranet.log.aliyuncs.com"
]
},
{
"region" : "cn-shanghai",
"endpoint_list": [
"cn-shanghai.log.aliyuncs.com" // If GA is required, replace this with log-global.aliyuncs.com
]
}
],
...
}
6. Restart iLogtail or LoongCollector.
ACK
Suppose an ACK cluster in Singapore needs to write logs to both the Shanghai and Singapore regions simultaneously.
On the SLS console, create Project A (located in the Shanghai region) and Project B (located in the Singapore region).
Create machine groups under Project A and Project B respectively.
In the ACK cluster, configure the ilogtail_config.json file of iLogtail or LoongCollector to support dual write.
For a specific configuration example of the ilogtail_config.json for iLogtail or LoongCollector, see the example provided in the foregoing ECS.
a) Create a ConfigMap.
b) Modify the deployment of logtail-ds or loongcollector-ds.
Mount the configuration file as a ConfigMap into logtail-ds or loongcollector-ds.
Modify ALIYUN_LOGTAIL_CONFIG to point to the mounted file path.
- Restart logtail-ds or loongcollector-ds.
How to Verify the Success of Multi-region Collection
- Observe the heartbeat status of the machine groups in Project A and Project B, and confirm that the heartbeat of both machine groups is functioning properly and that the machines in the two groups are consistent.
Reference
- Check if data is reported under Logstore A of Project A and Logstore B of Project B.
Note: CloudLens for SLS is recommended for collection status monitoring.
Logtail Overall Status
Logtail File Collection Monitoring
Logtail Exception Monitoring
- Cross-region Disaster Recovery: Configure Dual-write Collection Dual-write collection is a high-availability strategy. It means that through configuration, the iLogtail or LoongCollector agent on the client can send the same log data in real time and in parallel to two independent Alibaba Cloud SLS projects located in different regions (or different zones in the same region). If a zone of SLS fails, you can use this method to switch to another available SLS region.
First, refer to the fifth chapter to configure a multi-region collection environment, and then enable the collection configuration in your disaster recovery project to allow multiple file collections, so that logs can be written to the disaster recovery project.
7. Summary of Implementation Recommendations and Best Practices
Evaluation first: Based on the business deployment environment, log volume, real-time or stability requirements, security compliance needs, and cost budget, comprehensively evaluate and select the most appropriate network link solution.
Trade-off between cost and performance: A balance between cost, performance, security, and complexity is needed because there is no one-size-fits-all solution. Private networks in the same region are preferred.
Comprehensive monitoring and alerting: Use CloudLens to monitor overall resource consumption. Configure SLS alerts to monitor write success rates and latency. Monitor the agent status (CPU, memory, error logs, and network link quality).
More reliable agent (LoongCollector): Both iLogtail and LoongCollector are inherently highly reliable (with caching and retry capabilities), while LoongCollector offers even higher reliability than iLogtail (featuring functions such as isolation of abnormal network transmission). LoongCollector is in canary release.
LoongCollector dual-write mechanism: Logs are sent to multiple destinations in parallel to implement cross-region disaster recovery (logs are written to the backup destination when the primary destination fails to avoid data loss). LoongCollector also supports smooth migration (when a public network is switched to a private network, temporary dual-write is used to ensure service continuity and avoid interruptions).
Further questions: If you have any more questions, please consult us through tickets.
Top comments (0)