DEV Community

ObservabilityGuy
ObservabilityGuy

Posted on

Eliminating Query Timeouts: A Real-world Practice of SLS Materialized View Tuning in High Concurrency Scenarios

#ai

This article combines real business scenarios and Result Data to review how we optimized several slow queries that frequently timed out to achieve responses within seconds.
Backend and monitoring developers likely have experienced this anxiety: When the log Data Volume reaches a certain Size, previously smooth queries start to fail or become unresponsive. The monitoring service sends a flood of alarms, or stakeholders need data urgently, but the log API you invoke stalls, eventually returning a Request Timeout.

Recently, we collaborated with a power user (a large business team) to implement Simple Log Service (SLS) materialized views in their core log scenarios. We compared the performance before and after the feature was Enabled in the production environment. Whether in terms of hard Performance Data or actual user experience, the difference is significant.

This article combines real business scenarios and Result Data to review how we optimized several slow queries that frequently timed out to achieve responses within seconds.

Case 1: No more timeouts under high-concurrency SDK load
This is a very typical automated monitoring scenario. The User's monitoring service invokes the log API at high frequency via the SDK to pull the invocation latency data between services.

The difficulty lies in "high concurrency + Dynamic conditions." The monitoring program sends a large number of Requests in a short time, and the query conditions for each Request change. For example, it queries columnx:"abc" in one second, and columnx:"abd" in the next. This usage puts significant pressure on the backend. Before optimization, the average query took 4100 ms. This creates a vicious loop: slow query -> thread pool backlog -> concurrent processes further competing for resources -> eventually widespread timeouts.

SQL after removing business semantics:

query| select 
  column1, column2, column3, 
  (timestamp - timestamp % 3600) as time_slot, 
  count(*) as cnt, 
  avg(metric_val) as avg_lat 
  from log 
  group by column1,column2,column3,time_slot
Enter fullscreen mode Exit fullscreen mode

After using materialized views: The query duration plummeted to 46 ms, an 89-fold performance improvement. More importantly, now no matter how high the SDK concurrency is, or how the query conditions change, because only the pre-computed Result needs to be read, the response time is very stable, completely eliminating the timeout problem under high concurrency.

Case 2: Taming the performance killer: distinct count operations
Anyone who has worked with data knows that count(distinct) is a notoriously resource-intensive operation, especially in scenarios with large Data Volume.

User SQL:

query | select 
  project_id, 
  count(1) as event_cnt, 
  count(distinct hash_val) as issue_cnt
  from log
  group by project_id
Enter fullscreen mode Exit fullscreen mode

To get a distinct count of issue signatures (represented by the hash value) after removing duplicates, this SQL struggles when the Data Volume is large.

Before optimization: This query previously took an average of 16.8 s. If the Time Range is slightly extended (such as viewing the Trend over the past month), or the peak Traffic is slightly larger, the query often fails.

After optimization: Accelerated by materialized views, the query time dropped to 2.2 s, an 8-fold Performance improvement, transforming this feature from “frequently unusable” to “reliably available.”

Case 3: Comparative Analysis, from "54 s timeout" to "second-level response"
This is the scenario with the largest Performance improvement in this optimization. The User has a requirement to View the comparative change of operation log read latency (comparing Data from 1 Day ago, 3 Days ago, and 7 Days ago).

User SQL:

type:read| 
  select 
  time, 
  diff [1] as day1, 
  diff [2] as day2, 
  diff [3] as day3, 
  diff [4] as day7
  from ( 
    select 
    time,
    ts_compare(avg_latency, 86400, 172800,604800) as diff
    from ( 
      select 
      avg(latency) as avg_latency, 
      date_trunc('hour', __time__) as time
      from log 
    group time ) 
  group by time order by time ) 
Enter fullscreen mode Exit fullscreen mode

This SQL involves ts_compare and multilayer subquery nesting. When the query Time Range is large, the computational load is very high.

Before optimization: Duration was 54.3 s. If the backend service jitters slightly, the User's Request times out, rendering it essentially unusable.

After optimization: Duration is 958 ms. From a long wait of nearly one minute, it dropped to under one second. Performance improved by 56 times. This experience change from “unqueryable” to “near-instantaneous results” is the most tangible for O&M personnel waiting on the data.

A cost-benefit analysis

The ROI (Return on Investment) of this optimization is very high:

● High utilization rate: In one day, these views served a cumulative total of 10,223 queries.

● Extremely low cost: You may worry whether storing a copy of the Result is expensive. In fact, the added storage cost is less than 0.1% of the raw log storage fee, which is negligible.

Summary
Based on this practical experience, we also summarized three scenarios most suitable for SLS materialized views. If your business also fits the following situations, enable materialized views directly:

Tackling intractable slow queries: If your SQL contains a large number of deduplication statistics (count distinct), high-precision percentile calculations (approx_percentile), or data analytics involving long time ranges such as in Case 3. When the raw data volume is large, these operations are difficult to complete within a few seconds regardless of optimization, or even directly timeout. Materialized views can pre-process these computationally expensive tasks, turning "timeouts" into "second-level responses."
Scenarios requiring highly responsive user interfaces: It is not enough just to avoid timeouts. For data products directly facing Users, or core dashboards that executives view every day, 10 seconds and 1 second provide completely different experiences. If your Target is to make the dashboard operation as smooth as working with a local spreadsheet, pre-computation is essential.
A safeguard against high-concurrency failures: This is the most easily overlooked point. Often, although a single query is tolerable, once a failure occurs, dozens of people refresh the dashboard at the same time, plus hundreds of concurrent requests from automated inspection scripts (SDK), it is easy to trigger resource bottlenecks on the Server. The essence of a materialized view is to turn expensive "on-the-fly computation" into low-latency table lookups. At critical moments, this serves as the cornerstone that prevents the system from being overwhelmed.
A picture is worth a thousand words. We have condensed the core Performance Metrics and best Scenarios of this practice into the following infographic, hoping to provide a reference for your Performance Optimization.

Top comments (0)