ObservabilityGuy

Posted on Mar 18

Fast and Cost-effective: The New Version of SLS LogReduce, an Intelligent Engine That Discovers Patterns from Massive Logs

#sls #ai

This article introduces the new version of Alibaba Cloud SLS LogReduce, an intelligent log analysis engine that discovers patterns from massive logs in real time with zero index overhead.

Logs record the execution path of every request, every exception, and every line of Code. However, when the log volume expands from tens of thousands to hundreds of millions per day, traditional keyword search and manual filter methods become inadequate. The new version of LogReduce is designed to solve this dilemma. It can automatically discover log categories and extract log templates from massive logs, freeing engineers from "needle in a haystack" troubleshooting.

1.Why Intelligent LogReduce Is Needed
1.1 Cognitive Dilemma in "Log Floods"
With distributed Systems becoming increasingly complex today, a typical microservices model may contain dozens or even hundreds of service components, and each component continuously generates logs. Statistics show that a medium-sized internet application can generate a Log Volume of several TB per Day. Facing such massive Data, traditional log analysis methods face severe challenges:

Information overload: When Alerting is triggered, the engineer opens the log system and faces an overwhelming stream of logs. Which information is critical? Which is noise? Making judgments relies entirely on experience.

Keyword Dependency: Traditional methods rely on preset keywords (such as ERROR and Exception) to Filter. However, the problem is that unexpected abnormal patterns may be completely ignored.

Context fragmentation: Even if suspicious logs are found, understanding their meanings still requires a large amount of context information. The same type of issue may appear thousands of times in slightly different forms, making it difficult to summarize manually.

1.2 Evolution of SLS LogReduce
Alibaba Cloud Simple Log Service (SLS) is a cloud-native observability and analysis platform for logs. It provides Users with one-stop services such as Log Collection, storage, query, and Analysis. As one of the core capabilities of log analysis, SLS launched the LogReduce feature (hereinafter referred to as "old version LogReduce") early in its development to help users automatically extract patterns from massive logs.

The old version LogReduce adopts an "LogReduce at ingestion" architecture. It pre-calculates the clustering index when logs are ingested and maps each log to the corresponding pattern. The advantage of this method is that the clustering is comprehensive, but it also brings additional index storage costs, which may become a burden for large-scale log scenarios.

The "new version LogReduce" introduced in this topic is an architectural upgrade to the old version and adopts a new "LogReduce at query" design. It no longer requires pre-establishing clustering indexes. Instead, it calculates log patterns in real time when a User initiates a query, thereby achieving zero additional index Traffic, more flexible Analysis capabilities, and better cost-efficiency.

1.3 From "Viewing Logs" to "Understanding Logs"
The core idea of the new version LogReduce is: Let machines automatically discover patterns in logs.

LogReduce is based on a key Insight: Although a system may generate a huge volume of logs, they often originate from a limited number of log output statements. The logs generated by each log output statement have the same format and can be represented by the same "log template."

For example, the following three logs:

Got exception while serving block-123 to /10.251.203.149
Got exception while serving block-456 to /10.251.203.150
Got exception while serving block-789 to /10.251.203.151

Can be summarized into one template:

Got exception while serving <BLOCK_ID> to /<IP>

Where and are variable parts that change with each log. The rest are constants that remain unchanged in the same class of logs.

Through this abstraction, thousands of logs that would otherwise need to be reviewed one by one are compressed into a few log categories. Engineers can first locate issues at the log template level and then drill down to View specific log samples. This is exactly the cognitive upgrade brought by LogReduce.

2.Core Design Concepts
2.1 Zero Index Traffic: Lightweight Cost Advantage
Compared with the old version LogReduce, the biggest architectural advantage of the new version LogReduce is zero additional index Traffic.

The old version LogReduce needs to pre-calculate clustering indexes during Data Ingestion, which means that each log generates additional index storage costs. For a large LogStore, this cost can be considerable.

The new version LogReduce adopts a completely different policy: It calculates log templates in real time during queries based on existing field indexes. This "LogReduce at query" method avoids the storage overhead caused by pre-indexing and allows the clustering Result to reflect the latest log Data instantly.

2.2 Intelligent Sampling: Balancing Precision and Performance
When the Log Volume is particularly large (for example, there are tens of millions of logs within a time window), analyzing the entire dataset is neither practical nor necessary. The new version LogReduce has a built-in Intelligent sampling policy:

// Sampling policy: When the Log Volume exceeds the threshold, automatic downsampling is performed
const sampleQuery = logCount > 50000 
 ? `|sample-method='bernoulli' ${getSampleNumber(logCount, 50000)}`
 : ''

The sampling algorithm uses Bernoulli Sampling to ensure that each log record has an equal probability of being selected, thereby ensuring the representativeness of the sampling results. In the model building phase, the system samples up to 50,000 log records for pattern search. In the result matching phase, the system samples up to 200,000 log records for pattern matching and statistics.

This stratified sampling design allows the system to maintain response times within seconds when processing massive amounts of data, without significantly impacting clustering effectiveness.

2.3 Intelligent Variable Detection: Beyond Simple Pattern Matching
One of the core challenges of LogReduce is to accurately distinguish between "variable parts" and "constant parts" in logs. The new version of LogReduce uses a more intelligent variable detection algorithm that can handle various complex scenarios:

Numeric variables: Automatically detects numeric patterns such as numbers, IP addresses, and port numbers, and supports range statistics.

Enumeration variables: For variables with limited values (such as status codes and Service Names), the system automatically calculates the Top N value distribution.

Composite variables: For complex variable patterns (such as UUIDs and Trace IDs), the system intelligently detects their borders.

//Variable summary statistics | extend var_summary = summary_log_variables(variables_arr, '{"topk": 10}')

The variable summary (var_summary) not only records value samples of variables but also contains variable type inference (range / enum / gauge) and distribution statistics, laying a foundation for subsequent in-depth analysis.

3.Technical Implementation Highlights
3.1 SPL Operator-driven Clustering Pipeline
The core computation logic of the new version of LogReduce is implemented by using Structured Process Language (SPL) of SLS, forming a complete clustering pipeline:

3.1.1 Phase 1: Model Building

*
| stats content_arr = array_agg("Content")
| extend ret = get_log_patterns(
content_arr,
ARRAY['separator list'],     
cast(null as array(varchar)),
cast(null as array(varchar)),
'{"threshold": 3, "tolerance": 0.1, "maxDigitRatio": 0.1}'
)
| extend model_id = ret.model_id

get_log_patterns is the core pattern fetching operator. It accepts a set of log content and automatically searches for log templates within the content by using clustering algorithms. The algorithm parameters include:

• threshold: The minimum support value for detecting whether a token at a specific position is a variable. The larger the threshold, the less likely the token is determined to be a variable.

• tolerance: The toleration for variable detection. The smaller the toleration, the more likely frequently appearing tokens are determined to be constants. We recommend using the default value.

• maxDigitRatio: The maximum ratio threshold of numeric characters.

3.1.2 Phase 2: Pattern Matching

* 
| extend ret = match_log_patterns('${modelId}', "Content")
| extend pattern_id = ret.pattern_id,
  pattern = ret.pattern,          pattern_regexp = ret.regexp, variables = ret.variables
| stats event_num = count(1), hist = histogram(time_bucket_id)
   by pattern_id

match_log_patterns matches each log record with the searched patterns and fetches the following information:

• pattern_id: The ID of the pattern.

• pattern: The log template.

• pattern_regexp: The regular expression of the pattern.

• variables: The specific values of the variable parts.

3.1.3 Phase 3: Comparative Analysis (Optional)

| extend ret = merge_log_patterns('${modelId1}', '${modelId2}')
| extend model_id = ret.model_id

For comparative analysis scenarios, merge_log_patterns can merge the clustering models of two time ranges, thereby comparing them in a unified pattern space to detect new, disappeared, or changed log patterns.

3.2 Frontend Rendering: High-performance Big Data Display
In terms of frontend implementation, the core challenge faced by the LogReduce widget is: How to efficiently render and interact with a large number of clustering results?

3.2.1 Virtual Scrolling and Paging
Clustering results may contain hundreds or even thousands of log patterns. The system uses pagination, rendering only 15 records per page, combined with virtual scrolling technology to ensure the interface remains smooth:

// Paging logic
const [currentPage, setCurrentPage] = useState<number>(1)const pageSize = 15
const pagedResult = useMemo(() => {  
const startIndex = (currentPage - 1) * pageSize
return filteredResult.slice(startIndex, startIndex + pageSize)
}, [filteredResult, currentPage])

3.2.2 Interaction Design for Highlighted Variables
The variables in the log template need to be highlighted and allow you to view variable distribution upon clicking. The System implements a dedicated Highlight widget, which can:

• Parse template strings and detect variable placeholders.

• Generate an independent clickable area for each variable.

• Display the distribution statistics of the variable after the variable is clicked. (Enumeration types display Top N values, and Numeric types display range distribution)

3.2.3 Dual Column Chart in Comparative View

In comparative analysis mode, each log pattern needs to display the data distribution of two time ranges simultaneously. The System uses a dual-color column chart to meet this requirement:

• Dark columns: Log count in the current Time Range (experiment group).

• Light columns: Log count in the comparative Time Range (comparison group).

Through visual comparison, the User can intuitively discover:

• New log patterns. (Present in the experiment group, absent in the comparison group)

• Disappeared log patterns. (Absent in the experiment group, present in the comparison group)

• Log patterns with significant quantity changes.

3.3 Reverse Regular Expression Lookup: Bridging the Last Mile from Analysis to Query
After a problem pattern is discovered on the LogReduce Page, how can you view all logs of this category?

The new version of LogReduce solves this problem through regular expressions. Each log template automatically generates a corresponding regular expression (pattern_regexp). The User can copy this regular expression and use the regexp_like operator to perform a precise query:

* | SELECT * FROM log WHERE regexp_like(Content, 'Copied regular expression')

This design seamlessly connects Cluster Analysis with raw log query, allowing the User to, after a problem pattern is discovered, immediately drill down to view specific Log Details.

4.Typical Scenarios
4.1 Scenario 1: Quickly Locate Fault Logs
An e-commerce platform received a high volume of alerts during a promotional activity. The O&M engineer opens the LogReduce Page:

1.Set the Time Range to 10 minutes after the alerts started.
2.Filter out normal INFO-level logs. in the search statement: * and not LEVEL: INFO.
3.View the clustering results and discover a new pattern: Got exception while serving <*> to /: Connection timeout.
4.Click the pattern to view variable distribution, and discover that is concentrated in the 10.251.xxx.xxx network segment.
Determine that the issue might be a Network problem in the network segment, and immediately perform troubleshooting.
The entire process takes less than 5 minutes, whereas traditional keyword search may require trying multiple keyword combinations, taking several times longer.

4.2 Scenario 2: Post-Release Comparative Analysis
The development team published a new Version and needs to evaluate the Impact on log patterns:

1.Set the current Time Range to 1 hour after publishing.

2.Set the comparison time range to the one-hour period before the release.

3.View the comparison Results, and pay attention to the following situations:

Newly appearing Error Log patterns
Disappearing log patterns (some problems may have been fixed)
Patterns with significantly changed Quantity 4.For suspicious new patterns, Click to View log samples for further Analysis.

4.3 Scenario 3: Multi-module Group Analysis
When a Logstore contains logs from multiple modules, you can use the group clustering feature:

1.Select the aggregation field as Component or ServiceName.
2.The System will first group by module, and then perform clustering independently within each group.
3.By using the group view, you can quickly detect which module generated abnormal logs.

This layered Analysis method is particularly suitable for log analysis of large-scale Systems, avoiding mutual interference between logs of different modules.

5.Thoughts on Algorithm Design
5.1 Why Choose "clustering at query time"?
When designing the new version of LogReduce, we faced a key architecture decision: whether to pre-compute clustering indexes during writing or to perform real-time computing during querying?

We eventually chose the latter, mainly based on the following considerations:

Flexibility: The pre-computation method requires defining clustering fields and parameters in advance, which are difficult to change once configured. However, computing at query time allows Users to dynamically select clustering fields, filter conditions, and Time Ranges, providing greater flexibility.

Cost-effectiveness: Not all logs require Cluster Analysis. The pre-computation method processes all logs uniformly, incurring unnecessary costs. Computing at query time is "pay-as-you-go," consuming resources only when Analysis is truly needed.

Algorithm evolution: The clustering algorithm is a realm of continuous optimization. Computing at query time allows us to upgrade the algorithm at any time. New analyses automatically benefit from the latest algorithm improvements without reprocessing Historical Data.

5.2 The Art of Sampling: How to Balance Efficiency and Precision
Sampling is one of the key designs of the new version of LogReduce. A natural concern is: Will sampling miss important log patterns?

Our policy is "phased sampling":

Pattern discovery phase: 50,000 logs are sampled to discover patterns. Because the Quantity of log patterns is usually far less than the Quantity of logs (this is the basic assumption of LogReduce), a sampling of 50,000 logs is usually sufficient to discover the vast majority of patterns.

Pattern matching phase: 200,000 logs are sampled for Statistics. The sampling in this phase mainly affects the precision of Quantity Statistics, rather than pattern discovery.

Variable Statistics phase: For each pattern, the Top 10 variable values are retained. This is sufficient for Users to understand the distribution features of the variables.

Practice has shown that this stratified sampling policy can provide sufficiently accurate clustering Results in the vast majority of scenarios, while maintaining second-level query responses.

6.Summary and Outlook
The new version of LogReduce represents a paradigm shift in the realm of log analysis: from passive keyword search to active pattern discovery; from manual, line-by-line troubleshooting to intelligent categorization.

Its core value lies in:

1.Improved efficiency: Compressing millions of logs into hundreds of log categories allows engineers to quickly grasp the full picture of logs.

2.Enhanced insight: Automatically detecting newly appearing or disappearing log patterns, and discovering changes that are difficult to detect manually.

3.Cost optimization: The design of zero extra index Traffic ensures that Cluster Analysis is no longer a cost burden.

4.Flexible analysis: Supports various analysis modes such as Comparative Analysis and group clustering to meet the needs of different scenarios.

Looking ahead, LogReduce has more possibilities:

• Integration with outlier detection: Automatically detects log patterns with sudden increases or decreases in quantity to provide early warnings for potential issues.

• Integration with LLMs: Uses large language models to understand log semantics, assist in analyzing log templates, and provide more intelligent pattern classification and problem diagnosis.

• Integration with UModel: Associating entities with LogSets allows users to view LogReduce results and build a more complete observability knowledge graph.

We believe that as these capabilities continue to evolve, log analysis will transform from a tedious operational task into an intelligent System Insight tool.

DEV Community

Fast and Cost-effective: The New Version of SLS LogReduce, an Intelligent Engine That Discovers Patterns from Massive Logs

Top comments (0)