DEV Community

Cover image for Monitor CDN Performance with Real-Time CLS Log Analysis

Monitor CDN Performance with Real-Time CLS Log Analysis

A CDN is a performance layer, but its logs are also an operations dataset. Every request can reveal latency, cache behavior, response code, client distribution, traffic volume, and download speed. The source article explains how Tencent Cloud CDN logs can be delivered into Tencent Cloud CLS and analyzed in real time.

The original problem is familiar: CDN providers expose basic metrics such as request count and bandwidth, but default metrics are not enough for customized troubleshooting. Teams often download raw CDN logs for offline analysis. That approach has two drawbacks from the source article: it adds operations and development cost, and the data is not truly real time. Delays of more than half an hour are common in offline workflows.

The CDN-to-CLS path is designed for interactive analysis:

  • one-click log delivery;
  • second-level analysis for very large log volumes;
  • real-time dashboard visualization;
  • one-minute real-time alerting.

CDN log fields that matter

The source article lists the CDN log schema. The key fields are:

Field CLS type Meaning
app_id long Tencent Cloud account APPID.
client_ip text Client IP address.
file_size long File size.
hit text Cache HIT or MISS. Edge-node and parent-node hits are both marked as HIT.
host text Domain name.
http_code long HTTP status code.
isp text Carrier or ISP.
method text HTTP method.
param text URL parameters.
proto text HTTP protocol identifier.
prov text Carrier province.
referer text HTTP referer.
request_range text Range request parameter.
request_time long Response time in milliseconds, from node receiving the request to completing response delivery to the client.
request_port long Client-to-CDN-node connection port, or - if unavailable.
rsp_size long Response bytes.
time long Request time as a UNIX timestamp in seconds.
ua text User-Agent.
url text Request path.
uuid text Unique request identifier.
version long CDN real-time log version.

Scenario 1: alert when CDN latency exceeds a threshold

The source recommends percentiles instead of simple averages or individual samples. Averages can hide a small but important set of slow requests, while individual samples are too noisy. The example computes average latency, P50, and P99 over a one-day window represented by 1440 five-minute buckets.

* |
SELECT
  avg(request_time) AS l,
  approx_percentile(request_time, 0.5) AS p50,
  approx_percentile(request_time, 0.99) AS p99,
  time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') AS time
GROUP BY time
ORDER BY time DESC
LIMIT 1440
Enter fullscreen mode Exit fullscreen mode

The Chinese chart in this screenshot translates to: compare average latency, P50, and P99 across time. The operational value is that P99 reveals the long-tail experience even when the average line looks acceptable.

The alert condition in the source is based on P99 latency greater than 100 ms:

* |
SELECT
  approx_percentile(request_time, 0.99) AS p99
Enter fullscreen mode Exit fullscreen mode

The screenshot is the alert-condition configuration. In English, the rule computes p99 from request_time and triggers when the configured condition, such as P99 greater than 100 ms, is met.

This image shows multidimensional analysis settings. The source says the alert message should display affected host, url, and client_ip, so developers can quickly determine which domain, path, and client segment are involved.

Once the alert fires, the key information can be delivered immediately through channels such as WeChat, Enterprise WeChat, or SMS.

Scenario 2: alert when resource access errors spike

The source's second alert scenario is error-count growth. If page-access errors suddenly increase, the backend server may be failing or the service may be overloaded.

The source compares the latest one-minute error count with the previous one-minute count. Latest minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time DESC
  LIMIT 1
)
Enter fullscreen mode Exit fullscreen mode

Previous minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time ASC
  LIMIT 1
)
Enter fullscreen mode Exit fullscreen mode

The trigger expression from the source is:

$2.errct - $1.errct > 100
Enter fullscreen mode Exit fullscreen mode

Compare two query results in the alert policy. $2.errct is the latest minute's error count, $1.errct is the previous minute's error count, and the alert fires when the increase is greater than the selected threshold.

Build CDN quality and performance dashboards

The source article then turns CDN logs into dashboard metrics.

Health score

Health is defined as the percentage of requests whose http_code is below 500:

* |
SELECT
  round(
    sum(CASE WHEN http_code < 500 THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "health"
Enter fullscreen mode Exit fullscreen mode

The panel means: all or nearly all sampled requests returned HTTP status codes below 500 during the selected time range.

Cache hit rate

Cache hit rate is calculated among successful responses below 400:

http_code < 400 |
SELECT
  round(
    sum(CASE WHEN hit = 'hit' THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "cache hit rate"
Enter fullscreen mode Exit fullscreen mode

This panel helps operators see whether traffic is being served from CDN cache or falling back to origin paths.

Average download speed

Average download speed is total downloaded data divided by total request time:

* |
SELECT
  sum(rsp_size / 1024.0) / sum(request_time / 1000.0) AS "average download speed (kb/s)"
Enter fullscreen mode Exit fullscreen mode

The panel is converting rsp_size from bytes to KB and request_time from milliseconds to seconds.

ISP-level download analytics

The source uses ip_to_provider(client_ip) to map client IPs to carriers:

* |
SELECT
  ip_to_provider(client_ip) AS isp,
  sum(rsp_size) * 1.0 / (sum(request_time) + 1) AS "download speed (KB/s)",
  sum(rsp_size / 1024.0 / 1024.0) AS "total download volume (MB)",
  count(*) AS c
GROUP BY isp
ORDER BY c DESC
LIMIT 10
Enter fullscreen mode Exit fullscreen mode

For each ISP, show request count, total downloaded traffic, and computed download speed. This helps compare CDN quality across carriers.

Latency distribution buckets

The source groups requests into custom latency windows:

* |
SELECT
  CASE
    WHEN request_time < 5000 THEN '~5s'
    WHEN request_time < 6000 THEN '5s~6s'
    WHEN request_time < 7000 THEN '6s~7s'
    WHEN request_time < 8000 THEN '7~8s'
    WHEN request_time < 10000 THEN '8~10s'
    WHEN request_time < 15000 THEN '10~15s'
    ELSE '15s~'
  END AS latency,
  count(*) AS count
GROUP BY latency
Enter fullscreen mode Exit fullscreen mode

Instead of a single average, the panel shows how many requests fall into each duration range.

Practical monitoring plan

Start with three layers:

  1. Latency alerting: use P99 request latency and include affected host, url, and client_ip in the alert message.
  2. Error-growth alerting: compare the latest one-minute http_code >= 400 count with the previous minute.
  3. Performance dashboards: track health, cache hit rate, average download speed, ISP-level performance, and latency distribution.

This source-backed setup turns CDN access logs into an operations console: first alert on the abnormal condition, then use the same CLS dataset to explain which domain, path, ISP, client segment, or cache behavior is responsible.

Top comments (0)