How Tencent Cloud CLS Optimized Lucene for Time-Series Log Search

#lucene #logging #database #performance

Log search looks like text search until every query includes a time range. That time predicate changes the problem. In a high-volume log platform, timestamps are high-cardinality values, and scanning timestamp ranges can dominate query latency.

The source article describes Tencent Cloud CLS's time-series search engine, built on top of Lucene and accepted by VLDB 2022 under the paper title TencentCLS: The Cloud Log Service with High Query Performances.

According to the source article, the time-series search engine achieved nearly 40x improvement over a traditional search engine in massive log retrieval. It also reports 38x improvement for head queries, 24x for tail queries, and 7.6x for histogram queries in the paper-related experiment tables.

Why timestamp range search is hard in Lucene-style indexes

The source article starts with a typical log record:

[2021-09-28 10:10:39T1234] [ip=192.168.1.1]
XXXXXXXX

A log platform indexes the timestamp, attributes such as ip, and tokenized text. A typical query specifies a time range:

Select *
from xxxx_index
where ip = xxxx
  and timestmap >= 2021-09-28 xxxx
  and timestmap <= 2021-09-29 xxxx;

Lucene is strong at text search, but the article points out that timestamp range search is a high-cardinality numeric range problem. If a timestamp is stored at millisecond precision, one day contains:

24 * 60 * 60 * 1000 = 86,400,000

possible timestamp values. At microsecond precision, the possible values are another 1000x larger.

In an inverted index, a timestamp key maps to a posting list:

timestamp -> [docid1, docid2]

For an exact timestamp lookup, the search complexity is efficient. The source article describes normal search as O(log(n)). But a one-day timestamp range may require scanning a massive number of timestamp terms. The article describes the high-cardinality range search complexity as O(n), where n is the number of index terms.

The source gives a concrete scale example: in a 10-billion-log index, the observed timestamp-index data can be around 30GB. Reading that at 100MB/s would take about 300 seconds just to load the index data.

Optimization 1: order logs by timestamp

The central design shift is to organize log data by timestamp order. In the old layout, timestamps are unordered, so the engine must handle many timestamp index terms for a range query. In the time-ordered layout, a time range can be reduced to endpoint handling.

The source article states that this reduces the timestamp terms handled from hundreds of thousands or hundreds of millions down to two endpoints.

Optimization 2: add a secondary index for disk access

Simple binary search works well in memory, but the source article notes that it causes scattered disk access when applied to ordered column data. The solution is a secondary index that reduces disk access from dozens of operations to three.

Optimization 3: make reverse search fast

The source article says the original underlying iterators only supported one-way iteration. That is a problem for reverse chronological search: if the target data sits at the tail of a timestamp-ordered sequence, a one-way iterator must traverse all previous data first.

CLS solves this with a reverse binary-search algorithm built on top of the one-way iterator. The article reports that iteration count drops from tens of thousands or hundreds of thousands to dozens, and the complexity changes from O(n) to O(logn * logn).

Optimization 4: compute histograms from bucket boundaries

Histogram is one of the most common log-analysis operations. The source article says the original system computed histograms by reading timestamps back for every matching log, producing tens of thousands or hundreds of thousands of back-table lookups.

The optimized approach uses bucket boundaries to determine log-ID ranges. Instead of fetching timestamps for every matched log, the engine performs a few index accesses to find boundaries, then assigns internal points by comparing them with the bucket limits. The secondary index is also used here to reduce disk access for boundary lookups.

Reported performance results

The source article reports several performance results:

Test context	Source-reported result
Paper experiment, head query	38x improvement
Paper experiment, tail query	24x improvement
Paper experiment, histogram query	7.6x improvement
Offline prototype test on 8 million rows, 100 concurrent requests	50x response improvement, `1.059s` vs `56.9s`
Concurrency under sub-second response target	90+ vs 4, a 20x improvement
Online testing with writes present	Core operations were more than 10x faster
Cold-data scenario	Core operation response speed improved by 10x+

The article also notes that IO jitter had to be optimized before online testing, because a 2-3 second long-tail jitter is less visible when the original query takes more than 10 seconds, but severely distorts results when the optimized query runs in hundreds of milliseconds.

Comparison with a Lucene-based cloud log service

The source article compares CLS with another cloud log service in a one-billion-row scenario. It explains the difference through timestamp granularity:

System design	Timestamp index implication from the source
Minute-level index	One day has `24 * 60 = 1440` index terms
CLS microsecond-level index	One day can theoretically have `24 * 60 * 60 * 1000 * 1000 = 86,400,000,000` timestamp values

The source states that CLS previously used millisecond timestamps and moved to microsecond timestamps after the new index went online. The time-series index is the reason CLS can support high-cardinality timestamp retrieval while maintaining performance.

Engineering takeaways

Log search is not only text search; the time-range predicate can dominate the query plan.
High-cardinality timestamp fields are expensive when a range query must scan many index terms.
Ordering logs by timestamp changes range search from many-term processing to endpoint processing.
Secondary indexes matter when a theoretically efficient binary search would otherwise produce scattered disk reads.
Reverse chronological queries and histograms need specialized handling, because they are common in real log troubleshooting.
The source article's reported gains come from combining data layout, secondary indexing, reverse access, histogram boundary lookup, and IO jitter optimization.