Tencent Cloud -Cloud Log Service

Posted on Jun 10

Monitor CDN Performance with Real-Time CLS Log Analysis

#cdn #logging #observability #sql

A CDN is a performance layer, but its logs are also an operations dataset. Every request can reveal latency, cache behavior, response code, client distribution, traffic volume, and download speed. The source article explains how Tencent Cloud CDN logs can be delivered into Tencent Cloud CLS and analyzed in real time.

The original problem is familiar: CDN providers expose basic metrics such as request count and bandwidth, but default metrics are not enough for customized troubleshooting. Teams often download raw CDN logs for offline analysis. That approach has two drawbacks from the source article: it adds operations and development cost, and the data is not truly real time. Delays of more than half an hour are common in offline workflows.

The CDN-to-CLS path is designed for interactive analysis:

one-click log delivery;
second-level analysis for very large log volumes;
real-time dashboard visualization;
one-minute real-time alerting.

CDN log fields that matter

The source article lists the CDN log schema. The key fields are:

Field	CLS type	Meaning
`app_id`	long	Tencent Cloud account APPID.
`client_ip`	text	Client IP address.
`file_size`	long	File size.
`hit`	text	Cache HIT or MISS. Edge-node and parent-node hits are both marked as HIT.
`host`	text	Domain name.
`http_code`	long	HTTP status code.
`isp`	text	Carrier or ISP.
`method`	text	HTTP method.
`param`	text	URL parameters.
`proto`	text	HTTP protocol identifier.
`prov`	text	Carrier province.
`referer`	text	HTTP referer.
`request_range`	text	Range request parameter.
`request_time`	long	Response time in milliseconds, from node receiving the request to completing response delivery to the client.
`request_port`	long	Client-to-CDN-node connection port, or `-` if unavailable.
`rsp_size`	long	Response bytes.
`time`	long	Request time as a UNIX timestamp in seconds.
`ua`	text	User-Agent.
`url`	text	Request path.
`uuid`	text	Unique request identifier.
`version`	long	CDN real-time log version.

Scenario 1: alert when CDN latency exceeds a threshold

The source recommends percentiles instead of simple averages or individual samples. Averages can hide a small but important set of slow requests, while individual samples are too noisy. The example computes average latency, P50, and P99 over a one-day window represented by 1440 five-minute buckets.

* |
SELECT
  avg(request_time) AS l,
  approx_percentile(request_time, 0.5) AS p50,
  approx_percentile(request_time, 0.99) AS p99,
  time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') AS time
GROUP BY time
ORDER BY time DESC
LIMIT 1440

The Chinese chart in this screenshot translates to: compare average latency, P50, and P99 across time. The operational value is that P99 reveals the long-tail experience even when the average line looks acceptable.

The alert condition in the source is based on P99 latency greater than 100 ms:

* |
SELECT
  approx_percentile(request_time, 0.99) AS p99

The screenshot is the alert-condition configuration. In English, the rule computes p99 from request_time and triggers when the configured condition, such as P99 greater than 100 ms, is met.

This image shows multidimensional analysis settings. The source says the alert message should display affected host, url, and client_ip, so developers can quickly determine which domain, path, and client segment are involved.

Once the alert fires, the key information can be delivered immediately through channels such as WeChat, Enterprise WeChat, or SMS.

Scenario 2: alert when resource access errors spike

The source's second alert scenario is error-count growth. If page-access errors suddenly increase, the backend server may be failing or the service may be overloaded.

The source compares the latest one-minute error count with the previous one-minute count. Latest minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time DESC
  LIMIT 1
)

Previous minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time ASC
  LIMIT 1
)

The trigger expression from the source is:

$2.errct - $1.errct > 100

Compare two query results in the alert policy. $2.errct is the latest minute's error count, $1.errct is the previous minute's error count, and the alert fires when the increase is greater than the selected threshold.

Build CDN quality and performance dashboards

The source article then turns CDN logs into dashboard metrics.

Health score

Health is defined as the percentage of requests whose http_code is below 500:

* |
SELECT
  round(
    sum(CASE WHEN http_code < 500 THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "health"

The panel means: all or nearly all sampled requests returned HTTP status codes below 500 during the selected time range.

Cache hit rate

Cache hit rate is calculated among successful responses below 400:

http_code < 400 |
SELECT
  round(
    sum(CASE WHEN hit = 'hit' THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "cache hit rate"

This panel helps operators see whether traffic is being served from CDN cache or falling back to origin paths.

Average download speed

Average download speed is total downloaded data divided by total request time:

* |
SELECT
  sum(rsp_size / 1024.0) / sum(request_time / 1000.0) AS "average download speed (kb/s)"

The panel is converting rsp_size from bytes to KB and request_time from milliseconds to seconds.

ISP-level download analytics

The source uses ip_to_provider(client_ip) to map client IPs to carriers:

* |
SELECT
  ip_to_provider(client_ip) AS isp,
  sum(rsp_size) * 1.0 / (sum(request_time) + 1) AS "download speed (KB/s)",
  sum(rsp_size / 1024.0 / 1024.0) AS "total download volume (MB)",
  count(*) AS c
GROUP BY isp
ORDER BY c DESC
LIMIT 10

For each ISP, show request count, total downloaded traffic, and computed download speed. This helps compare CDN quality across carriers.

Latency distribution buckets

The source groups requests into custom latency windows:

* |
SELECT
  CASE
    WHEN request_time < 5000 THEN '~5s'
    WHEN request_time < 6000 THEN '5s~6s'
    WHEN request_time < 7000 THEN '6s~7s'
    WHEN request_time < 8000 THEN '7~8s'
    WHEN request_time < 10000 THEN '8~10s'
    WHEN request_time < 15000 THEN '10~15s'
    ELSE '15s~'
  END AS latency,
  count(*) AS count
GROUP BY latency

Instead of a single average, the panel shows how many requests fall into each duration range.

Practical monitoring plan

Start with three layers:

Latency alerting: use P99 request latency and include affected host, url, and client_ip in the alert message.
Error-growth alerting: compare the latest one-minute http_code >= 400 count with the previous minute.
Performance dashboards: track health, cache hit rate, average download speed, ISP-level performance, and latency distribution.

This source-backed setup turns CDN access logs into an operations console: first alert on the abnormal condition, then use the same CLS dataset to explain which domain, path, ISP, client segment, or cache behavior is responsible.

DEV Community