DEV Community: Tencent Cloud -Cloud Log Service

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Tencent Cloud -Cloud Log Service — Mon, 15 Jun 2026 11:06:29 +0000

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Cluster problems rarely appear from nowhere. Before a service outage becomes visible, Kubernetes often records smaller state changes: node pressure, Pod scheduling, Pod eviction, and cluster autoscaler decisions.

Tencent Kubernetes Engine can send those Events into Tencent Cloud CLS, where they become searchable logs and dashboard data. This gives operators a central way to answer what changed, when it changed, which object was involved, and which component reported it.

What an Event tells you

Kubernetes Events describe state transitions. The useful fields are:

Field	What to look for
`Type`	`Normal`, `Warning`, or a custom type.
`Involved Object`	Pod, Deployment, Node, or another Kubernetes object.
`Source`	Component such as Scheduler or Kubelet.
`Reason`	Short reason enum.
`Message`	Detailed explanation.
`Count`	How many times it happened.

The core flow is: Kubernetes emits a state-change record, CLS stores it as a log event, and the operator filters by object, component, reason, message, count, and timestamp.

Open Event Search

In TKE, go to Cluster Operations -> Event Search. CLS provides collection, storage, search, analysis, and dashboards for the event stream.

Use the overview when you need warning distribution, affected object types, and event trends. Use global search when you already know the component or object name and need a row-level timeline.

Runbook 1: an abnormal node

Filter by the abnormal node name in the event overview. In this example, the result included a node disk-space warning.

The timeline showed that on 2020-11-25, node 172.16.18.13 became abnormal because disk space was insufficient. Kubelet then tried to evict Pods from the node to reclaim disk space.

That sequence gives you a clean next step: check node disk usage, eviction thresholds, and workload placement before treating it as a generic application failure.

Runbook 2: autoscaler expansion

For node pool autoscaling, query the autoscaler component:

event.source.component:"cluster-autoscaler"

Display these fields:

event.reason
event.message
event.involvedObject.name

Sort by log time descending. The result should work like a compact ledger of autoscaler decisions: workload object, reason, message, and the timestamp of each scaling step.

The event stream showed scale-out around 2020-11-25 20:35:45, triggered by three nginx Pods:

nginx-5dbf784b68-tq8rd
nginx-5dbf784b68-fpvbx
nginx-5dbf784b68-v9jv5

Three nodes were added. Later scale-out did not continue because the node pool had reached its maximum node count.

Checklist

Use Events to understand state changes, not only current state.
Start with overview dashboards, then filter by object name.
For node issues, inspect reason, message, source component, and count.
For autoscaling, query cluster-autoscaler and reconstruct the event timeline.
Use metrics and logs after Events point you to the right object and time window.

FAQ

Why not only use `kubectl describe`?

kubectl describe is useful for one object. CLS is better when you need searchable history, dashboards, and cross-object analysis.

What is the fastest autoscaler query?

Start with event.source.component:"cluster-autoscaler" and sort by log time descending.

Manage Cloud Product Logs from an Architecture View with CLS and Cloud Advisor

Tencent Cloud -Cloud Log Service — Thu, 11 Jun 2026 07:42:42 +0000

In a complex cloud architecture, log troubleshooting usually starts with a resource map: which services are connected, where traffic flows, and which components have logs enabled. Tencent Cloud CLS and Cloud Advisor bring multi-product log management into the Cloud Advisor architecture view.

The integration combines three capabilities:

unified cloud-product log access management;
real-time cloud-product log search and analysis;
out-of-the-box operational dashboards for cloud products.

The key idea is that logs are no longer managed only from a separate log console. Operators can inspect log status, search logs, and open dashboards from the same architecture view they use to understand cloud-resource relationships.

The Cloud Advisor architecture interface keeps the resource topology visible while log status and summary metrics appear on the side. This creates a global operations view: resource topology on the left, log visibility and operational indicators on the right.

Capability 1: unified log access management

Cloud Advisor can show whether cloud-product instances have log delivery enabled, and it can support batch enabling or disabling logs.

The operation path in the article is:

enter the Cloud Advisor architecture view;
click the log-service plugin;
choose the cloud product;
open Access Management;
review the current log-delivery status of product instances.

The resource topology stays on the left, while a table-style management panel appears on the right. Operators can understand both context and configuration status at once.

Capability 2: query product logs from the architecture view

The integration also supports direct log search. Operators can query cloud-product logs by key fields and time range, then use the results to locate failures, trace access behavior, or monitor runtime status.

Open the log-search module from the log-service plugin, choose the cloud product, enter a query in the search box, and execute analysis. The chart and log list remain tied to the selected resource context, which reduces the need to switch between architecture diagrams and log consoles.

Capability 3: open out-of-the-box dashboards

Cloud Advisor can expose dedicated dashboards for cloud products. These dashboards can show performance monitoring, usage trends, anomaly detection, and other analysis results without extra manual configuration.

The dashboard view places summary cards and circular charts alongside the architecture view. After choosing Log Service -> Cloud Product -> Dashboard, operators can inspect product-specific log analysis without building the dashboard from scratch.

Supported products and log types

Nine cloud products are currently available through Cloud Advisor log access and management:

Product	Log type
Content Delivery Network CDN	Domain access logs
Cloud Load Balancer CLB	Load-balancer access logs
Object Storage COS	Bucket access logs
Tencent Kubernetes Engine TKE	Container business logs, cluster audit logs, and cluster event logs
Elastic MapReduce	Component runtime logs
TencentDB	Slow logs and error logs
Video on Demand VOD	Access logs
Cloud File Storage CFS	Audit logs
Web Application Firewall WAF	Access logs and attack logs

CLS also supports one-click collection and fast analysis for more than 60 cloud-product logs.

Troubleshooting workflow

A practical troubleshooting workflow looks like this:

Open Cloud Advisor to inspect the cloud architecture.
Use the CLS log-service plugin to check which resources have log delivery enabled.
Batch enable logs for missing resources when needed.
Search logs directly from the architecture view using product fields and time filters.
Open the prebuilt product dashboard to review performance, usage, and anomaly patterns.
Use the resource topology to connect log findings back to upstream and downstream dependencies.

Why this improves cloud operations

The integration is valuable because it joins three layers that are often separated:

the architecture view, which explains relationships;
the log status view, which explains whether evidence is being collected;
the log analysis view, which explains what happened.

For platform teams, that means faster global inspection, fewer console switches, and a clearer path from resource topology to evidence-level troubleshooting. The integration creates one-stop cloud-product log control and analysis, with future expansion planned for more product integrations and prebuilt log-alert capabilities.

How Beike Migrated a Large-Scale Observability Platform to CLS

Tencent Cloud -Cloud Log Service — Thu, 11 Jun 2026 07:20:12 +0000

Beike operates at a scale where observability is not a dashboard convenience. It is an operations requirement. Beike migrated from self-built operations systems to a new cloud-based observability platform with Tencent Cloud CLS.

The migration problem had three parts:

Original constraint	Detail
Low data linkage	Logs, monitoring, tracing, and other observability data existed in many old systems, with limited connection between systems.
Performance pressure	During daily settlement, write volume could increase by more than 10x. Large business lines already wrote more than 10 billion records per day, and broad queries often timed out.
Data was hard to use	Self-built systems lacked systematic display, consistent formatting, aggregation functions such as IP geolocation, and convenient sharing for dashboard results.

The goal was to build a unified, high-performance, reliable observability platform without heavily invading business logic.

The platform diagram presents a full-stack observability architecture. At the top are data sources such as logs, tracing, metrics, business data, and cloud products. In the middle are data collection, data processing, storage, analysis, dashboards, and alerting. On the output side, the platform supports data sharing, operational dashboards, and AI analysis. This is not only a log search migration; it is a unification of operations data.

Data ingestion: reduce delay while keeping existing collection logic

The first pain point was write delay. During settlement peaks, delayed reporting was unacceptable because teams needed same-day data for verification and incident response.

The first assumption was that expanding cloud resources would solve the delay, but the effect was limited. Further analysis by the CLS and Beike teams found that the bottleneck was mainly in the rdkafka component used by FluentD Kafka output. Tuning rdkafka alone could no longer satisfy Beike's scale.

CLS then developed a Fluentd Output plugin, published to the community. Data-reporting delay dropped from more than ten minutes to within one minute.

Peak write throughput reaches around 300 GB/min. This is the scale context for the ingestion redesign: the platform needed to absorb traffic bursts rather than only handle average write volume.

Multi-source ingestion without replacing every collector

Beike's environment included Prometheus-based metrics, SkyWalking-based tracing, and mixed ES/Loki-style log systems for network, business, security, and other logs. Most environments had already moved to containers, and FluentD was widely used for log collection, but each business department had its own collection logic.

The easiest migration path was to keep the existing collection method and change the target endpoint where possible.

The architecture uses five ingestion lanes:

business logs collected by fluentd are written through the Kafka protocol;
security logs collected by winlogbeat are written through the Kafka protocol;
tracing data from skywalking is written through an API path;
TKE audit logs collected by loglistener are collected through an agent path;
metrics written through SDKs are ingested as cloud-product log data.

This explains why the migration was low-intrusion: teams could preserve much of the existing collection stack while moving storage, search, and analysis into CLS.

Beike also configured traffic-change alerts for key business modules so traffic shifts could be detected before they became harder incidents.

Data processing: structure raw logs before storage

Beike had many business departments, which meant log formats were inconsistent. A central parser would not be enough; different business lines needed configurable parsing rules.

The CLS data-processing canvas supports visual processing before logs are stored in a topic. In this example, business logs are first split by delimiters and then fields are extracted with regular expressions. The displayed data is simulated.

Data analysis: make massive logs searchable and cheaper to retain

Two related problems appear at this scale: some logs must be stored for a long time due to compliance, while full-volume aggregation over very large datasets hurts analysis efficiency.

The solution combines:

Hybrid storage: short-term hot data supports analysis, while long-term cold data can move to low-frequency storage while still remaining queryable.
Scheduled SQL: complex raw logs are aggregated into business-level metrics and saved for long-term monitoring.

For Beike security logs, Windows event logs from employee office environments were collected into CLS. The security team configured more than one thousand SQL rules to aggregate by rule name, alert level, and host name. Scheduled SQL summarized results every minute, reducing complex logs into the indicators the business cared about.

After switching to CLS, real-time retrieval over more than 50 billion log records averaged only 10 seconds, and retrieval efficiency improved by 6x+ compared with the original system.

The operational view combines cards, charts, and log records: high-level indicators for scanability, charts for trend review, and raw records for drill-down.

Result display: dashboards and DataSight sharing

Before migration, Beike used open-source display components such as Grafana. Those systems had fixed presentation forms, required complex configuration, and were not convenient for sharing inside domestic office workflows.

After data was collected into CLS, Beike could configure multiple dashboards in the product console and share them to PC or mobile through the independent DataSight console.

This dashboard contains multi-dimensional charts such as traffic trend, distribution, and summary indicators. The displayed data is simulated, but the workflow is the real point: business teams can monitor operations through reusable dashboards instead of repeated ad hoc searches.

The summary visual reinforces the platform's role across real-time network dashboards, operations dashboards, multi-end sharing, and reporting. It connects the technical migration to daily operations usage.

Access control and smooth user migration

Beike already had more than one thousand independent R&D users in its internal operations platform, with permission boundaries by business area. Creating Tencent Cloud accounts for everyone was unrealistic.

CLS DataSight solved this through an embedded, independent console:

it can be embedded into the existing internal system;
it supports internal and external network access modes;
it provides an independent log entry and customizable account-password login;
it can connect to the user's LDAP system and inherit existing permission logic.

Reported results

The migration outcomes are:

more than one thousand business sections were connected to CLS in one person-day;
old and new systems switched smoothly without changing user habits;
10x peak write traffic dropped from more than ten minutes of delay to minute-level latency;
overall business efficiency improved by 20x;
retrieval over tens of billions of logs moved from minute-level to second-level;
retrieval efficiency improved by 6x+;
dashboards and traffic-change alerts made operations more visible and proactive.

Reusable migration pattern

The Beike case suggests a practical sequence for large observability migrations:

identify whether the true bottleneck is storage, query, collector output, or parsing;
preserve existing collection protocols where possible;
route logs, tracing, audit, and metrics into one analysis platform;
structure logs before storage through visual processing rules;
use scheduled SQL to turn massive raw logs into long-lived metrics;
separate hot and cold storage to balance cost and query requirements;
expose dashboards through an access model that matches the organization's existing identity system.

Detect Malicious Source IPs in CLS Logs with Tencent Security Intelligence

Tencent Cloud -Cloud Log Service — Thu, 11 Jun 2026 06:49:12 +0000

Access logs often contain the earliest evidence of attacks. The problem is that an IP address by itself is not enough. Operators need to know whether that source has been associated with attacks, exploitation, web attacks, brute force, or other malicious behavior.

Threat IP Detection in Tencent Cloud CLS, jointly released with Tencent Security Keen Lab, is based on Tencent Security threat intelligence from https://tix.qq.com/. CLS analyzes source IPs in access logs, identifies malicious IPs, and links the result back to business access logs so teams can assess and block risk.

Intelligence source and detection scope

The intelligence library contains 300 million+ security intelligence records and processes more than 3 trillion threat-data records per day.

After the feature is enabled, CLS automatically analyzes IPs in logs and identifies malicious categories including:

Threat category	Meaning
Network attack	Attacks against information systems, infrastructure, computer networks, or personal devices.
Exploit	Abuse of software vulnerabilities to access or damage a system without authorization.
Web attack	Examples include XSS, CSRF, and SQL injection.
Brute force	Attempts to gain account access through repeated password or credential guessing.

When a malicious IP is detected, the system provides threat level, threat classification tags, and related access logs in the current business system.

The detection dashboard turns threat intelligence into operational context. The visible layout combines summary counts, trend charts, a distribution chart, and a table of detected IPs. Instead of sending operators to a separate intelligence system first, the CLS view starts from business logs and then enriches suspicious sources.

The threat profile provides verdict, threat tags, sample records, geographic information, ASN, operator, visit count, and associated samples. The IP is marked as malicious and displays multiple labels such as malicious sample or bot-related risk. In an investigation workflow, this helps decide whether to block, rate-limit, or keep monitoring that IP.

Blocking example with CLB

Cloud Load Balancer provides a clear blocking example. After identifying a malicious IP, operators can bind or update a security group to deny that IP.

The CLB control plane supports binding a security group to the load balancer path. After the detection result identifies a risky source, attach a security policy and add the malicious IP to a deny rule.

Applicable log scenarios

Threat IP Detection can analyze several cloud-product access-log sources:

CLB access logs;
COS access logs;
CDN access logs;
EdgeOne access logs;
cloud-native API Gateway logs;
and other access-log sources.

Four usage scenarios are especially relevant:

Scenario	How the detection helps
Cloud-service access security	Detect malicious IP access to CLB, COS, CDN, EdgeOne, API Gateway, and similar services.
Web application security	Discover malicious IPs visiting websites.
API security	Identify abusive IP requests and reduce API misuse.
Security audit	Analyze internal traffic and operation logs for abnormal behavior.

Enable Threat IP Detection in CLS

To enable the feature, log in to the CLS console, open the cloud product center, and click Tencent Security | Threat IP Detection.

The configuration dialog asks for the log topic and the IP field to analyze. The minimal setup is:

choose the CLS log topic that contains the access logs;
select the field that stores the source IP;
confirm the configuration;
review detected malicious IPs and linked access logs;
configure an alert policy if teams need proactive notification.

Why this is useful in operations

The capability has three operational advantages:

Real-time detection: logs do not need preprocessing before analysis.
Proactive alerting: alert policies can notify users when a malicious IP is found.
Security collaboration: results can work with security groups, firewalls, WAF, and similar controls.

In practice, the strongest workflow is closed loop: detect a malicious source from logs, inspect its threat-intelligence profile, review which business endpoints it touched, trigger an alert when needed, and block or mitigate through the relevant security product.

Deliver Tencent Cloud CLS Logs to DLC for Spark-Based Analysis

Tencent Cloud -Cloud Log Service — Thu, 11 Jun 2026 03:27:39 +0000

Tencent Cloud CLS already supports log delivery to CKafka and COS. Another delivery target is now available: DLC, Tencent Cloud Data Lake Compute. With this path, logs stored in CLS can be delivered directly into DLC so teams can process and analyze them with Spark.

A CLS log topic can feed three downstream delivery paths: Data Lake Compute DLC, Message Queue CKafka, and Object Storage COS. DLC is the big-data analysis target to choose when the next step is Spark processing, streaming analysis, machine learning, or graph-style computation.

Why deliver logs to DLC?

DLC provides two advantages compared with traditional SQL-only processing:

Real-time stream processing: Spark Streaming can be used for real-time analysis.
Advanced Spark libraries: Spark includes MLlib for machine learning and GraphX for graph computation. Graph algorithms can support workloads such as relationship analysis in social-network data.

This makes the CLS-to-DLC path useful when logs are no longer just operational evidence. They become an input dataset for large-scale analysis pipelines.

Step 1: open Deliver to DLC from the CLS log topic

From the CLS log topic page, open Deliver to DLC in the left navigation. This starts the delivery-task configuration.

Step 2: choose the DLC database and table

Choose the region, DLC database, and target table. This creates the destination binding between the CLS topic and the DLC table.

Step 3: map CLS fields to DLC table fields

Field mapping is the most important operational step. Multiple data types are supported. If a CLS log field and a DLC table field use the same name, mapping can be automatic. If field names differ, manually enter the CLS log-field name and map it to the DLC field.

In practical terms:

use automatic mapping for same-name fields;
use manual mapping for renamed fields;
review data types before confirming the task;
use the DLC data-type documentation when a field requires type alignment.

The DLC data-type documentation is available at https://cloud.tencent.com/document/product/1342/96174.

Step 4: configure partition-field mapping

Partition-field mapping supports three options:

Partition strategy	Behavior
Time partition	Use the CLS log time field for partition mapping.
Other field partition	Select the corresponding log field and map it to a DLC partition field.
No partition mapping	Disable the partition-mapping switch when partition mapping is not required.

After the field and partition configuration is complete, click Confirm to create the delivery task.

When this pattern is useful

Use CLS-to-DLC delivery when:

log data must feed Spark jobs;
real-time stream processing is needed with Spark Streaming;
teams want to run MLlib-based analysis on operational logs;
logs need to join a broader data-lake workflow;
graph processing, such as relationship analysis, is part of the downstream workload.

For lighter asynchronous processing or event streaming, CKafka may still be the better target. For archiving or object-based retention, COS remains the natural delivery target. The value of the DLC path is that the log stream becomes directly available to a Spark-oriented analysis environment.

Source note: Splunk delivery preview

A future Deliver to Splunk capability is planned for early June. Splunk becomes another destination for log management and analysis, giving teams more choices for downstream log processing.

What an Intelligent Observability Maturity Model Means for Cloud Operations

Tencent Cloud -Cloud Log Service — Thu, 11 Jun 2026 02:39:16 +0000

Cloud observability is becoming harder because cloud systems are no longer static. Microservices, dynamic topology, cross-team dependencies, and rapidly growing telemetry volume all make traditional operations less predictable. Intelligent technologies, including large models, can help process large-scale observability data and accelerate incident discovery and resolution.

At the Cloud AI Compute Ignite Forum of the Global Digital Economy Conference, the Cloud Computing Intelligent Observability Capability Maturity Model standard was officially released. The standard is led by the China Academy of Information and Communications Technology, initiated by China Mobile Cloud, and approved by the CCSA TC1 WG5 cloud computing working group.

This launch defines the overall development direction for cloud operations. Intelligent observability is positioned as a complete capability model, rather than merely a set of tools.

What the standard tries to define

The standard defines key concepts, assessment dimensions, capability levels, and implementation paths for intelligent observability in cloud environments. Its goal is to guide organizations that want to apply intelligent methods to improve cloud-system observability.

The standard covers two major areas:

Area	Scope
Observability capability	Platform planning, resource design, correlation analysis, data standardization, alert-effectiveness design, data security, observed-object design, metric and threshold design, process design, daily operations, visualization, data validation, and data management.
Intelligent capability	Intelligent data analysis, log analysis, intelligent alert baseline, alert convergence, anomaly detection, trend prediction, root-cause analysis, intelligent optimization suggestions, natural-language interaction, tool calling, memory management, and self-reflection.

The standard contains 6 capability domains, 24 capability items, and more than 200 capability indicators.

The model separates capabilities into two layers. The upper layer is intelligent capability. One side focuses on scenario applications: intelligent data analysis, log analysis, alert baselines, alert convergence, anomaly detection, trend prediction, root-cause analysis, and optimization recommendations. The other side is an "observability intelligence body": natural-language interaction, tool calling, memory management, and self-reflection.

The lower layer is observability capability. It begins with planning and design, then moves into daily operations and data management. The data-management section explicitly includes collection, storage, and processing. On the right side, the model ties everything to continuous operations optimization, including platform operations, alert operations, and standardized IT-process operations.

Why this matters to platform teams

The model suggests a practical maturity path:

first make telemetry reliable and standardized;
then make data searchable, visual, and alertable;
then apply intelligent analysis to logs, anomalies, baselines, trends, and root causes;
finally connect the platform to continuous operational improvement.

That order matters. Large-model-based troubleshooting is much less useful when the underlying log, metric, tracing, alerting, and data-governance layers are inconsistent.

Where CLS fits in the maturity model

Tencent Cloud CLS is one of the core participating products in the standard work. CLS representatives joined multiple discussions with experts from China Mobile Cloud, ZTE, and other cloud vendors and companies.

The CLS capability map connects the maturity model to a concrete platform architecture. On the left, data comes from endpoints, online and offline systems, open-source ecosystems, applications, and cloud-product ecosystems. The diagram includes sources such as iOS, Android, webpages, Windows, servers, IDC, Tencent Cloud, AWS, Beats, Log4j, Kubernetes, VictoriaMetrics, Logstash, Fluentd, Logback, OpenTelemetry, syslog, MySQL, Windows events, CVM, TKE, SCF, EKS, CDN, CLB, COS, Oceanus, TDMQ, and cloud development services.

In the center, CLS provides collection and ingestion through LogListener, Kafka protocol, Prometheus protocol, API, and SDK. It then supports dashboards, charts, alert customization, alert suppression, alert grouping, data processing with 90+ functions, CQL/KQL-compatible search, SQL analysis with 300+ functions, correlation analysis, PromQL, low-frequency log storage, standard log storage, timed SQL, and metric storage.

Outputs include visualization through DataSight and Grafana, alert channels such as Enterprise WeChat, DingTalk, Feishu, WeChat, email, SMS, custom callbacks, and phone calls, consumption through SCF, Oceanus, Kafka, Spark, Hive, Flink, ClickHouse, and Elasticsearch, plus delivery to COS and CKafka.

User examples

Three customer examples show how this capability set is used:

NIO used CLS security monitoring capabilities for millisecond-level security monitoring, tagging, desensitization, and an overall log-data security observability platform.
Beike used CLS search and analysis capabilities to build a new unified observability platform and improve overall business efficiency.
Lebo used the CLS collection ecosystem for multi-terminal one-stop data collection and reporting, improving full-link observability and user-experience optimization.

Practical takeaway

For cloud teams, the maturity model is useful because it converts "make observability intelligent" into a capability checklist. A mature platform should not only collect logs and metrics. It should standardize data, support analysis and visualization, provide alert governance, preserve data securely, connect to downstream processing systems, and gradually add intelligent analysis such as anomaly detection, root-cause analysis, tool calling, and natural-language operations.

AI Agent Observability with OpenClaw: Sessions, Tool Calls, Latency, Errors, and Token Cost in Tencent Cloud CLS

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 13:14:23 +0000

AI agents are difficult to operate when their behavior is spread across sessions, token usage, operations, model activity, queues, logs, and security-sensitive actions. A cost spike, slow response, repeated failed operation, or risky command is hard to explain unless each agent session can be connected to cost, latency, errors, operation records, and raw logs.

This guide explains how to use OpenClaw Usage Insights with Tencent Cloud Log Service (CLS) to monitor AI agent cost, operations, sessions, security risks, and log evidence. It focuses on the signals, dashboards, onboarding path, and troubleshooting workflow that help developers and operators understand what happened inside an OpenClaw agent system and where CLS fits as a managed log service for search, analysis, dashboards, and operational visibility.

When to use this pattern

Use this pattern when an OpenClaw-based AI agent system needs more than basic application logs. Typical signs include:

token usage or cost increases without a clear session or operation owner;
agent sessions are difficult to reconstruct after a user complaint;
operations become slower, fail more often, or create queue backlog;
operators need to compare cost, latency, errors, sessions, and model usage over time;
security-sensitive commands or file access need audit records;
dashboards show an anomaly, but engineers still need raw logs for root cause analysis.

OpenClaw Usage Insights is built on Tencent Cloud Log Service (CLS). After OpenClaw runtime data is connected to CLS, the system provides prebuilt views for cost governance, operations monitoring, session management, session detail analysis, security audit, and raw log search.

AI agent observability signals to collect

Before reviewing dashboards, make sure the logs can connect agent behavior to cost, operations, sessions, and security. The exact field names can follow your application schema, but each event should preserve enough context for CLS log search and dashboard analysis.

Signal category	What it helps explain	Useful fields or dimensions
Session context	Which session produced an interaction and how the session evolved.	session identifier, server instance, start time, end time, message count, average turns
Cost and token usage	Which sessions, messages, or usage patterns drive token spend.	total cost, total token usage, average session cost, single-message cost, cost distribution
Operation activity	What the agent or platform did during execution.	operation name, command, status, duration, tool invocation count, card distribution
Latency and reliability	Where execution becomes slow or unstable.	queue backlog, response degradation, execution latency, P95 latency, error growth
Session detail	What happened inside one conversation or task.	session content, per-turn detail, token usage, problem checks, prompt optimization clues
Security and risk	Whether the agent performed sensitive or high-risk actions.	high-risk session count, high-risk command execution, sensitive-file access
Raw log context	How engineers verify the original event behind a dashboard trend.	timestamp, instance, filter condition, query statement, raw log content, statistical result

The important design point is traceability. A dashboard can tell you that cost increased or latency degraded; the log context should let you filter back to the related instance, session, operation, command, or event record.

OpenClaw Usage Insights and Tencent Cloud CLS workflow

The onboarding flow has three prerequisites:

OpenClaw is installed and running.
Tencent Cloud CLS is activated.
A Tencent Cloud API key is available, including SecretId and SecretKey.

After the prerequisites are ready, operators open the OpenClaw entry in the CLS Application Center and connect the machines where OpenClaw is running. The workflow supports two deployment paths:

Deployment path	How it works	When to use it
Tencent Cloud CVM or Lighthouse	Select uncollected server instances, enter `SecretId` and `SecretKey`, then let the console complete the installation.	Use this when OpenClaw runs on Tencent Cloud-hosted machines.
Self-managed server	Select the region, enter the API credentials, copy the generated command, and run it on the target server.	Use this when OpenClaw runs outside Tencent Cloud infrastructure.

After connection, the access-management list becomes the operational inventory. It shows which OpenClaw machines are connected and available for dashboards and log search. From there, operators can select a server instance and open the prebuilt dashboard set.

Cost monitoring for AI agent sessions

Token cost is one of the first signals teams notice, but total cost alone is not enough for troubleshooting. An OpenClaw operator needs to know whether spend is global, concentrated in a few sessions, caused by a specific interaction pattern, or related to a small group of messages.

The cost governance dashboard helps break the problem down:

Cost view	What to check	Why it matters
Total cost	Overall spend trend for the selected OpenClaw instance.	Confirms whether cost is actually increasing in the observed time range.
Total token usage	Token consumption trend and total token volume.	Separates token growth from other operational symptoms.
Average session cost	Typical cost per session.	Helps identify whether normal sessions became more expensive.
Cost distribution	Cost by session, message, or visible usage dimension.	Finds high-cost sessions or interaction patterns that deserve inspection.
Single-message cost	Cost at a more granular interaction level.	Helps narrow a session-level spike to a specific turn or message.

A practical investigation usually starts with a cost trend, then moves to high-cost sessions, then opens session detail or raw log search to verify what actually happened.

Operations monitoring for latency, failures, and abnormal activity

AI agent reliability is not only about final answers. Operators also need to watch the runtime path: message processing, queue behavior, response time, execution latency, error growth, and repeated abnormal activity.

The operations monitoring dashboard is useful when the symptom is operational rather than financial:

queue backlog indicates that work is waiting longer than expected;
response degradation suggests that users may experience slower answers;
error growth points to instability in the agent workflow or runtime path;
P95 execution latency helps expose slow-tail behavior that average latency can hide;
card distribution, log series, and runtime metrics help operators compare behavior across time windows.

When latency or errors rise, the dashboard should be treated as the starting point. The next step is to filter the related raw logs by instance, session, time range, condition, or query statement so the team can inspect the original event records.

Session analysis for reconstructing agent behavior

Session management is the bridge between user-facing behavior and system-level signals. A session view helps answer questions such as:

How many sessions are active or historical in the selected scope?
How many turns does a typical session contain?
Which sessions contain frequent tool invocations or unusual interaction patterns?
Which channels or models are involved in the observed usage?
Which session should be opened when investigating cost, latency, errors, or risky actions?

The session detail dashboard adds a more focused troubleshooting layer. Operators can open a session from the session overview by selecting a session identifier or session content row. They can also open the session-detail dashboard directly and filter by server instance and session ID.

For incident review, this matters because a single user complaint or abnormal cost event is rarely explained by one aggregate chart. The session detail view lets teams reconstruct the interaction path, inspect per-turn details, review token usage, check problem indicators, and identify prompt optimization clues.

Security audit for risky operations

AI agent systems can execute commands, touch files, and perform actions that need review. The security audit view focuses on security-sensitive behavior rather than normal product usage.

Use the security audit dashboard to check:

high-risk sessions that need review;
high-risk command execution;
sensitive-file access;
whether a risky action can be connected back to a session or operation;
whether the original log record supports the dashboard-level security signal.

This is especially useful when a team needs an audit trail. The goal is not only to count risky events, but to connect each event to enough context for review: which session it appeared in, what operation or command was involved, and what raw log evidence is available.

Raw log search for root cause analysis

Dashboards are good for trends and outliers. Raw log search is where the team verifies the actual event.

Inside the OpenClaw application page in CLS, operators can open Log Search, select a server instance, add filter conditions, or use AI-assisted query statement generation. The result keeps raw logs and statistical analysis together, which supports a practical investigation loop:

Notice a cost, latency, error, session, or security anomaly in a dashboard.
Identify the related instance, session, time range, operation, command, or condition.
Open log search and filter for the relevant records.
Compare raw events with the dashboard trend.
Decide whether the issue is a cost pattern, runtime failure, slow operation, risky action, or session-specific behavior.

This log-first evidence path is what makes the dashboards actionable. Without raw records, a chart can show that something changed, but it cannot prove why.

Troubleshooting flow: from symptom to source log

Use the dashboards and log search together instead of treating any single view as the final answer.

Symptom	First check	Next step in CLS
Token cost increases	Review total cost, total token usage, average session cost, and high-cost sessions.	Filter logs by instance, session, message, time range, or visible cost dimension.
Agent responses become slow	Check queue backlog, response degradation, and P95 execution latency.	Compare operation records and raw logs in the affected time window.
Errors increase	Review error growth and related runtime metrics.	Search raw logs for the related condition, status, or event records.
A user reports an abnormal session	Open session management, then drill into the related session detail.	Reconstruct the session in order and inspect per-turn cost, operations, and problem checks.
A risky action appears	Check security audit records for high-risk sessions, commands, or sensitive-file access.	Inspect the linked session or log records to verify the event context.
Dashboard trend is unclear	Identify the instance and time range behind the trend.	Use log search with conditions or AI-assisted query statements to inspect raw records.

Common pitfalls

Looking only at total cost without breaking it down by session, message, or usage pattern.
Treating a dashboard trend as the final answer without checking raw logs.
Connecting OpenClaw machines but not confirming that the access-management list shows the expected instances.
Reviewing session volume without drilling into session details for abnormal behavior.
Counting risky operations without preserving enough context for audit review.
Ignoring P95 latency and queue backlog when users report slow responses.

FAQ

What should I check when AI agent token cost suddenly increases?

Start with total cost and total token usage, then review average session cost and cost distribution. If a small number of sessions or messages account for the increase, open session detail and raw log search to verify what happened.

How can I trace what happened inside one OpenClaw agent session?

Use session management to locate the relevant session, then open the session detail view by session identifier or by filtering for the server instance and session ID. Review the interaction path, per-turn details, token usage, problem checks, and related log records.

What logs are useful for AI agent observability?

Useful logs connect session context, token usage, cost, operations, latency, errors, risky commands, sensitive-file access, and raw event records. The exact schema can vary, but the records should let operators move from a dashboard trend back to the original event.

Why do dashboards still need raw log search?

Dashboards summarize cost, operations, sessions, and security signals. Raw log search provides the evidence layer. When investigating cost spikes, latency degradation, error growth, or risky actions, raw logs help verify the cause behind the trend.

When should I use Tencent Cloud Log Service for OpenClaw monitoring?

Use Tencent Cloud Log Service (CLS) when OpenClaw operations need searchable logs, cost governance dashboards, runtime monitoring, session analysis, security audit views, and raw-log troubleshooting in one managed log service.

How do I investigate slow or failed OpenClaw operations?

Start with operations monitoring. Check queue backlog, response degradation, execution latency, P95 latency, and error growth. Then use CLS log search to inspect the affected instance and time range so the team can review the original records.

Final checklist

Before relying on OpenClaw Usage Insights for production monitoring, verify that:

OpenClaw is running on the target machine;
Tencent Cloud CLS has been activated;
SecretId and SecretKey are available for onboarding;
Tencent Cloud-hosted or self-managed servers are connected through the correct path;
the access-management list shows the expected OpenClaw instances;
cost, operations, session, session detail, and security audit dashboards are populated;
raw log search can filter the records needed for investigation;
cost, latency, error, session, and security reviews can move from dashboard trend to raw event evidence.

For AI agent operations, the most useful observability system is not only a set of charts. It is a path from symptom to session, from session to operation, and from operation to raw log evidence. OpenClaw Usage Insights and Tencent Cloud CLS provide that path for teams that need cost control, runtime monitoring, session reconstruction, security audit, and practical troubleshooting.

Collect Logs from a Self-Managed Kubernetes Cluster into Tencent Cloud CLS

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 08:36:08 +0000

Self-managed Kubernetes clusters do not automatically inherit the console-driven log collection experience of managed TKE clusters. The source article explains how Tencent Cloud CLS can collect logs from a self-managed Kubernetes cluster by using a Kubernetes CRD named LogConfig and three components: Log-Provisioner, Log-Agent, and LogListener.

For TKE users, the source points to the TKE log-collection document and console path. This article focuses on the self-managed Kubernetes path.

Prerequisites

The source article lists four prerequisites:

Kubernetes cluster version 1.10 or later;
CLS enabled, with a logset and log topic already created;
the CLS topic ID, topicId;
the region endpoint for the log topic, CLS_HOST;
Tencent Cloud API credentials required for CLS-side authentication: TmpSecretId and TmpSecretKey.

How the collection architecture works

The Kubernetes deployment includes one custom resource and three runtime components.

Component	Role from the source article
`LogConfig`	Defines where logs are collected from, how they are parsed, and which CLS log topic receives them.
`Log-Provisioner`	Synchronizes the log collection configuration defined in `LogConfig` to the CLS side.
`Log-Agent`	Watches `LogConfig` and container changes on nodes, then calculates the real host-machine path of container log files.
`LogListener`	Collects matching log files from the host path, parses them, and uploads them to CLS.

Deployment flow

The source article uses this sequence:

Define the LogConfig resource type with a CRD.
Define a LogConfig object.
Create the LogConfig object.
Configure the CLS authentication ConfigMap.
Deploy Log-Provisioner.
Deploy Log-Agent and LogListener.
Search the collected logs in the CLS console.

Step 1: define the LogConfig CRD

Using /usr/local/ on the master node as the example path:

wget https://mirrors.tencent.com/install/cls/k8s/CRD.yaml
kubectl create -f /usr/local/CRD.yaml

Step 2: define the LogConfig object

Download the sample declaration:

wget https://mirrors.tencent.com/install/cls/k8s/LogConfig.yaml

The source explains that LogConfig.yaml has two main parts:

Section	Purpose
`clsDetail`	Defines the log parsing format and target CLS `topicId`.
`inputDetail`	Defines the log source: where the logs are collected from.

Replace clsDetail.topicId with the real topic ID created in CLS.

Supported parsing formats

Single-line full text

Use this when one line is one complete log entry. CLS stores the line in __CONTENT__ and does not extract fields.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: minimalist_log

Example collected output:

__CONTENT__: Tue Jan 22 12:08:15 CST 2019 Installed: libjpeg-turbo-static-1.2.90-6.el7.x86_64

Multi-line full text

Use this for logs such as Java stack traces. The source uses a beginning-line regex so a timestamped line starts a new log event, and later stack-trace lines are appended to the current event.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: multiline_log
    extractRule:
      beginningRegex: '\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d{3}\s.+'

Single-line full regex

Use this when a complete single-line log should be parsed into multiple key-value fields.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: fullregex_log
    extractRule:
      logRegex: '(\S+)[^\[]+(\[[^:]+:\d+:\d+:\d+\s\S+)\s"(\w+)\s(\S+)\s([^"]+)"\s(\S+)\s(\d+)\s(\d+)\s(\d+)\s"([^"]+)"\s"([^"]+)"\s+(\S+)\s(\S+).*'
      beginningRegex: '(\S+)[^\[]+(\[[^:]+:\d+:\d+:\d+\s\S+)\s"(\w+)\s(\S+)\s([^"]+)"\s(\S+)\s(\d+)\s(\d+)\s(\d+)\s"([^"]+)"\s"([^"]+)"\s+(\S+)\s(\S+).*'
      keys:
        - remote_addr
        - time_local
        - request_method
        - request_url
        - http_protocol
        - http_host
        - status
        - request_length
        - body_bytes_sent
        - http_referer
        - http_user_agent
        - request_time
        - upstream_response_time

Multi-line full regex

Use this when one structured log event spans multiple lines and fields still need to be extracted.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: multiline_fullregex_log
    extractRule:
      beginningRegex: '\[\d+-\d+-\w+:\d+:\d+,\d+\]\s\[\w+\]\s.*'
      logRegex: '\[(\d+-\d+-\w+:\d+:\d+,\d+)\]\s\[(\w+)\]\s(.*)'
      keys:
        - time
        - level
        - msg

JSON logs

For JSON logs, CLS extracts first-level keys as fields.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: json_log

Delimiter logs

For delimiter logs, define the delimiter and the keys that map to each segment.

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  clsDetail:
    topicId: xxxxxx-xx-xx-xx-xxxxxxxx
    logType: delimiter_log
    extractRule:
      delimiter: ':::'
      keys:
        - IP
        - time
        - request
        - host
        - status
        - length
        - bytes
        - referer

Supported Kubernetes log sources

The source article gives three source types.

Container stdout

Collect all container stdout logs in the default namespace:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  inputDetail:
    type: container_stdout
    containerStdout:
      namespace: default
      allContainers: true

Collect stdout from the ingress-gateway deployment in the production namespace:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  inputDetail:
    type: container_stdout
    containerStdout:
      allContainers: false
      workloads:
        - namespace: production
          name: ingress-gateway
          kind: deployment

Collect stdout from pods labeled k8s-app=nginx:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  inputDetail:
    type: container_stdout
    containerStdout:
      namespace: production
      allContainers: false
      includeLabels:
        k8s-app: nginx

Container files

Collect /data/nginx/log/access.log from the nginx container in the ingress-gateway deployment:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  topicId: xxxxxx-xx-xx-xx-xxxxxxxx
  inputDetail:
    type: container_file
    containerFile:
      namespace: production
      workload:
        name: ingress-gateway
        type: deployment
      container: nginx
      logPath: /data/nginx/log
      filePattern: access.log

Collect the same file path from pods with label k8s-app=ingress-gateway:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  inputDetail:
    type: container_file
    containerFile:
      namespace: production
      includeLabels:
        k8s-app: ingress-gateway
      container: nginx
      logPath: /data/nginx/log
      filePattern: access.log

Host files

Collect every .log file under /data on the host:

apiVersion: cls.cloud.tencent.com/v1
kind: LogConfig
spec:
  inputDetail:
    type: host_file
    hostFile:
      logPath: /data
      filePattern: '*.log'

Step 3: create the LogConfig object

After editing LogConfig.yaml, create the object:

kubectl create -f /usr/local/LogConfig.yaml

Step 4: configure CLS authentication

Download the sample ConfigMap:

wget https://mirrors.tencent.com/install/cls/k8s/ConfigMap.yaml

Set TmpSecretId and TmpSecretKey to the API key ID and API key value used for CLS authentication. Then create it:

kubectl create -f /usr/local/ConfigMap.yaml

Step 5: deploy Log-Provisioner

Log-Provisioner discovers and watches the log topic ID, collection rule, and file path in LogConfig, then synchronizes that configuration to CLS.

wget https://mirrors.tencent.com/install/cls/k8s/Log-Provisioner.yaml
kubectl create -f /usr/local/Log-Provisioner.yaml

Before applying the file, set the CLS_HOST environment variable in Log-Provisioner.yaml to the endpoint of the target CLS topic region.

Step 6: deploy Log-Agent and LogListener

The source separates responsibilities:

Log-Agent pulls log-source information from LogConfig and calculates the absolute host path for container logs.
LogListener collects and parses files from that host path, then uploads them to CLS.

wget https://mirrors.tencent.com/install/cls/k8s/Log-Agent.yaml
kubectl create -f /usr/local/Log-Agent.yaml

If the host Docker root is not /var/lib/docker, update the Log-Agent.yaml volume mapping. The source screenshot shows /data/docker mounted into the container as an example.

In English, the highlighted YAML is saying: when Docker data lives under /data/docker on the host, mount that path into the Log-Agent container so the agent can map container log files back to their real host locations.

Step 7: verify logs in the CLS console

After CRD creation, LogConfig creation, authentication, Log-Provisioner, Log-Agent, and LogListener are all deployed, open the CLS log search page for the target topic.

Kubernetes logs have been collected and can be searched in the CLS console. The top area is the histogram over time; the lower area displays matching raw log events.

Deployment checklist

Confirm Kubernetes version is 1.10 or later.
Create a CLS logset and log topic, then record topicId.
Find the right regional CLS_HOST.
Prepare TmpSecretId and TmpSecretKey.
Create the LogConfig CRD.
Choose a parsing format: single-line text, multi-line text, full regex, multi-line full regex, JSON, or delimiter.
Choose the log source: container stdout, container file, or host file.
Apply LogConfig, ConfigMap, Log-Provisioner, and Log-Agent.
If Docker is not rooted at /var/lib/docker, mount the actual Docker root path into the agent.
Verify collected logs in CLS search.

Monitor CDN Performance with Real-Time CLS Log Analysis

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 08:10:13 +0000

A CDN is a performance layer, but its logs are also an operations dataset. Every request can reveal latency, cache behavior, response code, client distribution, traffic volume, and download speed. The source article explains how Tencent Cloud CDN logs can be delivered into Tencent Cloud CLS and analyzed in real time.

The original problem is familiar: CDN providers expose basic metrics such as request count and bandwidth, but default metrics are not enough for customized troubleshooting. Teams often download raw CDN logs for offline analysis. That approach has two drawbacks from the source article: it adds operations and development cost, and the data is not truly real time. Delays of more than half an hour are common in offline workflows.

The CDN-to-CLS path is designed for interactive analysis:

one-click log delivery;
second-level analysis for very large log volumes;
real-time dashboard visualization;
one-minute real-time alerting.

CDN log fields that matter

The source article lists the CDN log schema. The key fields are:

Field	CLS type	Meaning
`app_id`	long	Tencent Cloud account APPID.
`client_ip`	text	Client IP address.
`file_size`	long	File size.
`hit`	text	Cache HIT or MISS. Edge-node and parent-node hits are both marked as HIT.
`host`	text	Domain name.
`http_code`	long	HTTP status code.
`isp`	text	Carrier or ISP.
`method`	text	HTTP method.
`param`	text	URL parameters.
`proto`	text	HTTP protocol identifier.
`prov`	text	Carrier province.
`referer`	text	HTTP referer.
`request_range`	text	Range request parameter.
`request_time`	long	Response time in milliseconds, from node receiving the request to completing response delivery to the client.
`request_port`	long	Client-to-CDN-node connection port, or `-` if unavailable.
`rsp_size`	long	Response bytes.
`time`	long	Request time as a UNIX timestamp in seconds.
`ua`	text	User-Agent.
`url`	text	Request path.
`uuid`	text	Unique request identifier.
`version`	long	CDN real-time log version.

Scenario 1: alert when CDN latency exceeds a threshold

The source recommends percentiles instead of simple averages or individual samples. Averages can hide a small but important set of slow requests, while individual samples are too noisy. The example computes average latency, P50, and P99 over a one-day window represented by 1440 five-minute buckets.

* |
SELECT
  avg(request_time) AS l,
  approx_percentile(request_time, 0.5) AS p50,
  approx_percentile(request_time, 0.99) AS p99,
  time_series(__TIMESTAMP__, '5m', '%Y-%m-%d %H:%i:%s', '0') AS time
GROUP BY time
ORDER BY time DESC
LIMIT 1440

The Chinese chart in this screenshot translates to: compare average latency, P50, and P99 across time. The operational value is that P99 reveals the long-tail experience even when the average line looks acceptable.

The alert condition in the source is based on P99 latency greater than 100 ms:

* |
SELECT
  approx_percentile(request_time, 0.99) AS p99

The screenshot is the alert-condition configuration. In English, the rule computes p99 from request_time and triggers when the configured condition, such as P99 greater than 100 ms, is met.

This image shows multidimensional analysis settings. The source says the alert message should display affected host, url, and client_ip, so developers can quickly determine which domain, path, and client segment are involved.

Once the alert fires, the key information can be delivered immediately through channels such as WeChat, Enterprise WeChat, or SMS.

Scenario 2: alert when resource access errors spike

The source's second alert scenario is error-count growth. If page-access errors suddenly increase, the backend server may be failing or the service may be overloaded.

The source compares the latest one-minute error count with the previous one-minute count. Latest minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time DESC
  LIMIT 1
)

Previous minute:

* |
SELECT *
FROM (
  SELECT *
  FROM (
    SELECT *
    FROM (
      SELECT
        date_trunc('minute', __TIMESTAMP__) AS time,
        count(*) AS errct
      WHERE http_code >= 400
      GROUP BY time
      ORDER BY time DESC
      LIMIT 2
    )
  )
  ORDER BY time ASC
  LIMIT 1
)

The trigger expression from the source is:

$2.errct - $1.errct > 100

Compare two query results in the alert policy. $2.errct is the latest minute's error count, $1.errct is the previous minute's error count, and the alert fires when the increase is greater than the selected threshold.

Build CDN quality and performance dashboards

The source article then turns CDN logs into dashboard metrics.

Health score

Health is defined as the percentage of requests whose http_code is below 500:

* |
SELECT
  round(
    sum(CASE WHEN http_code < 500 THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "health"

The panel means: all or nearly all sampled requests returned HTTP status codes below 500 during the selected time range.

Cache hit rate

Cache hit rate is calculated among successful responses below 400:

http_code < 400 |
SELECT
  round(
    sum(CASE WHEN hit = 'hit' THEN 1.00 ELSE 0.00 END)
    / cast(count(*) AS double) * 100,
    1
  ) AS "cache hit rate"

This panel helps operators see whether traffic is being served from CDN cache or falling back to origin paths.

Average download speed

Average download speed is total downloaded data divided by total request time:

* |
SELECT
  sum(rsp_size / 1024.0) / sum(request_time / 1000.0) AS "average download speed (kb/s)"

The panel is converting rsp_size from bytes to KB and request_time from milliseconds to seconds.

ISP-level download analytics

The source uses ip_to_provider(client_ip) to map client IPs to carriers:

* |
SELECT
  ip_to_provider(client_ip) AS isp,
  sum(rsp_size) * 1.0 / (sum(request_time) + 1) AS "download speed (KB/s)",
  sum(rsp_size / 1024.0 / 1024.0) AS "total download volume (MB)",
  count(*) AS c
GROUP BY isp
ORDER BY c DESC
LIMIT 10

For each ISP, show request count, total downloaded traffic, and computed download speed. This helps compare CDN quality across carriers.

Latency distribution buckets

The source groups requests into custom latency windows:

* |
SELECT
  CASE
    WHEN request_time < 5000 THEN '~5s'
    WHEN request_time < 6000 THEN '5s~6s'
    WHEN request_time < 7000 THEN '6s~7s'
    WHEN request_time < 8000 THEN '7~8s'
    WHEN request_time < 10000 THEN '8~10s'
    WHEN request_time < 15000 THEN '10~15s'
    ELSE '15s~'
  END AS latency,
  count(*) AS count
GROUP BY latency

Instead of a single average, the panel shows how many requests fall into each duration range.

Practical monitoring plan

Start with three layers:

Latency alerting: use P99 request latency and include affected host, url, and client_ip in the alert message.
Error-growth alerting: compare the latest one-minute http_code >= 400 count with the previous minute.
Performance dashboards: track health, cache hit rate, average download speed, ISP-level performance, and latency distribution.

This source-backed setup turns CDN access logs into an operations console: first alert on the abnormal condition, then use the same CLS dataset to explain which domain, path, ISP, client segment, or cache behavior is responsible.

Analyze CLB Access Logs in Tencent Cloud CLS

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 07:37:22 +0000

Cloud Load Balancer access logs answer a question that ordinary application logs often cannot: what happened between the client, the load balancer, and the real server?

The source article focuses on Layer 7 CLB access logs and shows how to send them into Tencent Cloud CLS for search, SQL analysis, dashboards, and alerts. It is especially useful when a small number of requests fail under high QPS, when backend servers do not see a request, or when application-side response_time looks normal but users still report slow requests.

Common CLB troubleshooting questions

The original article groups CLB log use cases into troubleshooting and statistical analysis.

Area	Source-backed question	Why CLB logs help
Exception localization	Under high QPS, a few client requests fail and the real server does not receive them. Did the load balancer receive the requests?	CLB access logs show the load-balancer-side request record.
Latency diagnosis	End users report slow requests, while the real server's `response_time` is normal. Where was time spent?	CLB logs expose `request_time`, `upstream_response_time`, `upstream_connect_time`, and `upstream_header_time`.
Layer 7 incident scope	Internal Layer 7 requests fail during a time window. Which part of the path is abnormal?	Logs can be filtered by CLB VIP, listener port, server name, upstream address, status code, and request.
Protocol analysis	HTTP/2 is enabled, but teams need to know whether it is actually being used.	The `protocol_type` and `server_protocol` fields can be analyzed.
Traffic distribution	Core domains are distributed across different CLB instances. What is the request share by instance?	Queries can group by `server_addr`, `server_name`, `http_host`, or other dimensions.

Onboarding path 1: enable logs for one Layer 7 instance

For a single CLB instance, the source article uses this flow:

Select the target Layer 7 CLB instance.
Click the edit icon.
Enable the access-log switch.
Select the target CLS logset and log topic.
If no suitable logset or topic exists, create one from the access-log page.
Submit the configuration.
Open the target log topic and edit the index.
When logs arrive, use automatic index configuration and enable statistics for analysis fields.

The Chinese UI in this screenshot is showing the CLB instance detail area. The highlighted control is the edit entry for configuring Layer 7 access logs on that specific instance.

This modal translates to: Enable CLS log service for the current CLB instance. The operator turns on the switch and confirms the setting.

This configuration page asks where the access logs should be delivered. In English: choose the logset, choose the log topic, and save. If the target does not exist yet, the source article points operators to the access-log page to create it first.

After access logs start arriving, go to the CLS log topic and open index editing. Index configuration is required before the fields can be searched and analyzed efficiently.

The source recommends enabling statistics for all relevant fields. In the screenshot, the operator uses automatic field detection and enables the statistics switch so fields can be used in SQL aggregation and dashboard panels.

Onboarding path 2: batch access through the dedicated CLB logset

The source article also describes a batch onboarding path for creating a dedicated clblog logset.

Important source note: at the time described by the article, batch onboarding requires CLB product allowlist access before the entry is visible.

Recommended topic design from the source:

separate topics by technical layer, such as HTTP layer, cache layer, or data layer;
or separate topics by business dimension, such as finance, main site, or order business;
remember that CLS can also act as a pipeline, so different topics may later be delivered to COS, CKafka, SCF, or other processing paths for archiving and downstream handling.

The highlighted Chinese label is Access Logs. This is the batch configuration entry that opens the dedicated CLB logset setup page.

The UI text means: the CLB logset name is fixed, so the operator mainly chooses retention and creates a log topic. The source says the topic should be named according to the real business grouping.

This screenshot shows selecting CLB instances for the new topic. In English: choose the target load balancers, add them to the topic, and save the batch relationship.

The final setup step is to save. The source article says the configuration takes about 5-10 minutes to become effective. After that, use the same index and statistics settings as the single-instance path.

CLB access-log field reference

The original article lists the CLB log variables. The most operationally useful fields are:

Field	Type	Meaning
`stgw_request_id`	text	Request ID.
`time_local`	text	Access time and timezone, such as `01/Jul/2019:11:11:00 +0800`.
`protocol_type`	text	Protocol type: HTTP, HTTPS, SPDY, HTTP2, WS, or WSS.
`server_addr`	text	CLB VIP.
`server_port`	long	CLB VPort, meaning listener port.
`server_name`	text	Rule `server_name`, the domain configured in the CLB listener.
`remote_addr`	text	Client IP address.
`remote_port`	long	Client port.
`status`	long	Status code returned from CLB to the client.
`upstream_addr`	text	Real server address.
`upstream_status`	text	Status code returned from the real server to CLB.
`request`	text	Request line, including method, path, and protocol.
`request_length`	long	Bytes received from the client.
`bytes_sent`	long	Bytes sent to the client.
`http_host`	text	Request domain from the HTTP `Host` header.
`http_user_agent`	text	User-Agent header.
`http_referer`	text	HTTP request source.
`request_time`	double	Total processing time from the first byte received from the client to the last byte sent back.
`upstream_response_time`	double	Time spent on the backend request, from connecting to the real server until the response is fully received.
`upstream_connect_time`	double	TCP connection time to the real server.
`upstream_header_time`	double	Time from connecting to the real server until the response header is fully received.
`tcpinfo_rtt`	long	TCP RTT.
`ssl_handshake_time`	double	SSL handshake time.
`ssl_cipher`	text	SSL cipher suite.
`ssl_protocol`	text	SSL protocol version.
`vip_vpcid`	long	VPC ID of the CLB VIP. For public CLB, the value is `-1`.
`uri`	text	Resource identifier.
`server_protocol`	text	CLB protocol.

Search examples from the source

Find a specific URL request where request time is greater than a threshold:

request:"HEAD /aaa/ HTTP/1.1" AND request_time:>0.005

Find 4xx requests for a specific real server:

status:[400 TO 500} AND upstream_addr:"10.0.1.12:80"

Build dashboard panels from CLB logs

The source article gives three dashboard query patterns.

Time dashboard: average request duration by CLB instance

* |
SELECT
  HISTOGRAM(CAST(__TIMESTAMP__ AS TIMESTAMP), INTERVAL 1 MINUTE) AS dt,
  AVG(request_time) AS "average request duration per CLB instance",
  server_addr
GROUP BY dt, server_addr
ORDER BY dt

This panel is used to observe website response time in real time and identify which CLB instance is slowing down.

The screenshot shows a CLS analysis dashboard. In English, the top area is the time-series result, and the bottom area keeps log details available for drill-down after a spike is found.

The highlighted Chinese operation is Add to dashboard. After running an analysis query and selecting a chart type, the operator saves the chart into a reusable dashboard.

This screenshot is the time dashboard. It translates to: compare request_time by server_addr in one-minute buckets, so operators can quickly see whether latency is isolated to one CLB instance.

Capacity dashboard: request count by real server

* |
SELECT
  HISTOGRAM(CAST(__TIMESTAMP__ AS TIMESTAMP), INTERVAL 1 MINUTE) AS dt,
  COUNT(1) AS "requests per minute to each real server",
  upstream_addr
GROUP BY dt, upstream_addr
ORDER BY dt

This panel checks backend capacity distribution. If one upstream_addr receives an unexpected share of requests, the team can review CLB rules or backend health.

The screenshot shows multiple line charts. In English, each line represents a real server address and the count of requests it receives per minute.

Status-code dashboard: request count by CLB status

* |
SELECT
  HISTOGRAM(CAST(__TIMESTAMP__ AS TIMESTAMP), INTERVAL 1 MINUTE) AS dt,
  COUNT(1),
  status
GROUP BY dt, status
ORDER BY dt

This panel tracks service health by status code. It is useful for separating client errors, backend errors, and normal traffic.

The screenshot combines bar and pie-style views. In English, it is grouping request volume by status so operators can see whether error codes are rising.

The dashboard screenshot shows the outcome of the previous steps: CLB access logs become operational panels rather than one-off search results.

Add real-time alerts from search-analysis results

The source article closes with real-time alerting. CLS can create alert rules from flexible search-analysis queries, attach alert policies, and notify teams through channels such as WeChat, Enterprise WeChat, or webhook.

This screenshot translates to: define the alert query, set the trigger condition, configure scheduling and notification, and route the alert to the selected receiver. For CLB logs, useful conditions include abnormal status-code count, high request_time, or unexpected changes in backend traffic distribution.

Practical CLB log playbook

Enable CLB access logs either per instance or through batch onboarding.
Configure the CLS index and enable statistics for fields used in aggregation.
Start with search examples for URL latency and real-server 4xx requests.
Build dashboards around request_time, upstream_addr, and status.
Add alerts for latency spikes, backend error growth, and abnormal request distribution.
Keep the raw log view near charts so every spike can be investigated with the exact request context.

Unify Tencent Cloud CLS Alert Notifications with Observability Templates

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 07:20:26 +0000

Log alerts rarely live alone. In real operations, log monitoring, cloud product monitoring, application monitoring, and endpoint monitoring often need to notify the same teams. If every product maintains its own recipients, channels, rotations, and callbacks, alert operations become duplicated and easy to miss.

The source article introduces a new CLS capability: Tencent Cloud Cloud Log Service alerts can now send notifications through Tencent Cloud Observability Platform notification templates. The practical change is simple but useful: CLS alert policies can reuse the same notification policy layer as cloud product monitoring, APM, and terminal performance monitoring.

Why unify alert notification templates?

The original article frames the problem as alert fragmentation. Separate notification settings across products can create three operational issues:

Problem	What happens in practice	What the new path changes
Repeated maintenance	Teams configure recipients and channels in several products	CLS can reuse Observability Platform templates
Scattered alerts	Log alerts and cloud-resource alerts are reviewed in different places	Notification strategy becomes more consistent across products
Missed escalation	A channel or rotation may be updated in one product but not another	Duty schedules, phone rotation, and callbacks can be managed centrally

Capability map from the source article

The source article highlights three capabilities.

Capability	Source-backed detail
Unified configuration	CLS alert policies can directly reuse Observability Platform notification templates. The same notification strategy can be shared with cloud product monitoring, APM, and endpoint performance monitoring alerts.
Multi-channel notification	Supported channels include SMS, email, phone calls, WeChat, Enterprise WeChat, DingTalk, Feishu, Slack, PagerDuty, Teams, and custom callback APIs.
Advanced alert handling	The Observability Platform can provide duty schedules, phone notification rotation, and alert-message delivery to SCF. The source also mentions future support for alert convergence.

Configure CLS to use an Observability Platform template

The usage flow is short:

In a CLS alert policy, choose Observability Platform notification template as the notification method.
Select an existing notification template that was already created in the Tencent Cloud Observability Platform.
If no suitable template exists, create a new template.
After alerts are delivered, review CLS alert history through Alert Governance in the Observability Platform.

The screenshot above shows the CLS alert-policy configuration page. The highlighted area is the notification method. Instead of configuring a standalone CLS-only receiver, the policy uses the Observability Platform template type.

This screenshot shows the template-selection step. In English, the operator is choosing a previously created notification template from the Observability Platform and applying it to the CLS alert policy.

If the existing list does not contain the right template, the dialog allows the operator to create one. The practical translation of this step is: define the receiver policy once, then attach it to CLS alerts.

The final screenshot shows the Observability Platform's Alert Governance area. The source article says CLS alert history can be reviewed there, which gives operators one place to trace notification events after alerts fire.

When this is the right pattern

Use this integration when:

log alerts and cloud-resource alerts should notify the same team;
a team already manages duty schedules or notification rotations in the Observability Platform;
CLS alerts need channels such as Enterprise WeChat, Slack, PagerDuty, Teams, or custom callbacks;
operations teams want alert history and governance to be reviewed in a central alert console.

Keep standalone CLS notification settings only when a log alert has an intentionally isolated audience or an independent callback path.

Operational checklist

Create or identify the Observability Platform notification template first.
In the CLS alert policy, set the notification method to the Observability Platform template option.
Select the existing template or create a new one from the policy configuration flow.
Test that the expected channel receives the alert.
Use Alert Governance to review historical CLS alert events after delivery.

FAQ

What problem does this solve for CLS alerting?

It reduces duplicated notification configuration and makes CLS alert delivery part of the same alert-management layer used by other Tencent Cloud observability products.

Which notification channels are supported by the source article?

The source lists SMS, email, phone calls, WeChat, Enterprise WeChat, DingTalk, Feishu, Slack, PagerDuty, Teams, and custom callback APIs.

Does this replace alert rules?

No. The source article describes notification delivery and governance. CLS still owns the log alert policy and alert condition; the Observability Platform template owns the notification route.

How Tencent Cloud CLS Optimized Lucene for Time-Series Log Search

Tencent Cloud -Cloud Log Service — Wed, 10 Jun 2026 07:08:51 +0000

Log search looks like text search until every query includes a time range. That time predicate changes the problem. In a high-volume log platform, timestamps are high-cardinality values, and scanning timestamp ranges can dominate query latency.

The source article describes Tencent Cloud CLS's time-series search engine, built on top of Lucene and accepted by VLDB 2022 under the paper title TencentCLS: The Cloud Log Service with High Query Performances.

According to the source article, the time-series search engine achieved nearly 40x improvement over a traditional search engine in massive log retrieval. It also reports 38x improvement for head queries, 24x for tail queries, and 7.6x for histogram queries in the paper-related experiment tables.

Why timestamp range search is hard in Lucene-style indexes

The source article starts with a typical log record:

[2021-09-28 10:10:39T1234] [ip=192.168.1.1]
XXXXXXXX

A log platform indexes the timestamp, attributes such as ip, and tokenized text. A typical query specifies a time range:

Select *
from xxxx_index
where ip = xxxx
  and timestmap >= 2021-09-28 xxxx
  and timestmap <= 2021-09-29 xxxx;

Lucene is strong at text search, but the article points out that timestamp range search is a high-cardinality numeric range problem. If a timestamp is stored at millisecond precision, one day contains:

24 * 60 * 60 * 1000 = 86,400,000

possible timestamp values. At microsecond precision, the possible values are another 1000x larger.

In an inverted index, a timestamp key maps to a posting list:

timestamp -> [docid1, docid2]

For an exact timestamp lookup, the search complexity is efficient. The source article describes normal search as O(log(n)). But a one-day timestamp range may require scanning a massive number of timestamp terms. The article describes the high-cardinality range search complexity as O(n), where n is the number of index terms.

The source gives a concrete scale example: in a 10-billion-log index, the observed timestamp-index data can be around 30GB. Reading that at 100MB/s would take about 300 seconds just to load the index data.

Optimization 1: order logs by timestamp

The central design shift is to organize log data by timestamp order. In the old layout, timestamps are unordered, so the engine must handle many timestamp index terms for a range query. In the time-ordered layout, a time range can be reduced to endpoint handling.

The source article states that this reduces the timestamp terms handled from hundreds of thousands or hundreds of millions down to two endpoints.

Optimization 2: add a secondary index for disk access

Simple binary search works well in memory, but the source article notes that it causes scattered disk access when applied to ordered column data. The solution is a secondary index that reduces disk access from dozens of operations to three.

Optimization 3: make reverse search fast

The source article says the original underlying iterators only supported one-way iteration. That is a problem for reverse chronological search: if the target data sits at the tail of a timestamp-ordered sequence, a one-way iterator must traverse all previous data first.

CLS solves this with a reverse binary-search algorithm built on top of the one-way iterator. The article reports that iteration count drops from tens of thousands or hundreds of thousands to dozens, and the complexity changes from O(n) to O(logn * logn).

Optimization 4: compute histograms from bucket boundaries

Histogram is one of the most common log-analysis operations. The source article says the original system computed histograms by reading timestamps back for every matching log, producing tens of thousands or hundreds of thousands of back-table lookups.

The optimized approach uses bucket boundaries to determine log-ID ranges. Instead of fetching timestamps for every matched log, the engine performs a few index accesses to find boundaries, then assigns internal points by comparing them with the bucket limits. The secondary index is also used here to reduce disk access for boundary lookups.

Reported performance results

The source article reports several performance results:

Test context	Source-reported result
Paper experiment, head query	38x improvement
Paper experiment, tail query	24x improvement
Paper experiment, histogram query	7.6x improvement
Offline prototype test on 8 million rows, 100 concurrent requests	50x response improvement, `1.059s` vs `56.9s`
Concurrency under sub-second response target	90+ vs 4, a 20x improvement
Online testing with writes present	Core operations were more than 10x faster
Cold-data scenario	Core operation response speed improved by 10x+

The article also notes that IO jitter had to be optimized before online testing, because a 2-3 second long-tail jitter is less visible when the original query takes more than 10 seconds, but severely distorts results when the optimized query runs in hundreds of milliseconds.

Comparison with a Lucene-based cloud log service

The source article compares CLS with another cloud log service in a one-billion-row scenario. It explains the difference through timestamp granularity:

System design	Timestamp index implication from the source
Minute-level index	One day has `24 * 60 = 1440` index terms
CLS microsecond-level index	One day can theoretically have `24 * 60 * 60 * 1000 * 1000 = 86,400,000,000` timestamp values

The source states that CLS previously used millisecond timestamps and moved to microsecond timestamps after the new index went online. The time-series index is the reason CLS can support high-cardinality timestamp retrieval while maintaining performance.

Engineering takeaways

Log search is not only text search; the time-range predicate can dominate the query plan.
High-cardinality timestamp fields are expensive when a range query must scan many index terms.
Ordering logs by timestamp changes range search from many-term processing to endpoint processing.
Secondary indexes matter when a theoretically efficient binary search would otherwise produce scattered disk reads.
Reverse chronological queries and histograms need specialized handling, because they are common in real log troubleshooting.
The source article's reported gains come from combining data layout, secondary indexing, reverse access, histogram boundary lookup, and IO jitter optimization.

DEV Community: Tencent Cloud -Cloud Log Service

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

What an Event tells you

Open Event Search

Runbook 1: an abnormal node

Runbook 2: autoscaler expansion

Checklist

FAQ

Why not only use kubectl describe?

What is the fastest autoscaler query?

Manage Cloud Product Logs from an Architecture View with CLS and Cloud Advisor

Capability 1: unified log access management

Capability 2: query product logs from the architecture view

Capability 3: open out-of-the-box dashboards

Supported products and log types

Troubleshooting workflow

Why this improves cloud operations

How Beike Migrated a Large-Scale Observability Platform to CLS

Data ingestion: reduce delay while keeping existing collection logic

Multi-source ingestion without replacing every collector

Data processing: structure raw logs before storage

Data analysis: make massive logs searchable and cheaper to retain

Result display: dashboards and DataSight sharing

Access control and smooth user migration

Reported results

Reusable migration pattern

Detect Malicious Source IPs in CLS Logs with Tencent Security Intelligence

Intelligence source and detection scope

Blocking example with CLB

Applicable log scenarios

Enable Threat IP Detection in CLS

Why this is useful in operations

Deliver Tencent Cloud CLS Logs to DLC for Spark-Based Analysis

Why deliver logs to DLC?

Step 1: open Deliver to DLC from the CLS log topic

Step 2: choose the DLC database and table

Step 3: map CLS fields to DLC table fields

Step 4: configure partition-field mapping

When this pattern is useful

Source note: Splunk delivery preview

What an Intelligent Observability Maturity Model Means for Cloud Operations

What the standard tries to define

Why this matters to platform teams

Where CLS fits in the maturity model

User examples

Practical takeaway

AI Agent Observability with OpenClaw: Sessions, Tool Calls, Latency, Errors, and Token Cost in Tencent Cloud CLS

When to use this pattern

AI agent observability signals to collect

OpenClaw Usage Insights and Tencent Cloud CLS workflow

Cost monitoring for AI agent sessions

Operations monitoring for latency, failures, and abnormal activity

Session analysis for reconstructing agent behavior

Security audit for risky operations

Raw log search for root cause analysis

Troubleshooting flow: from symptom to source log

Common pitfalls

FAQ

What should I check when AI agent token cost suddenly increases?

How can I trace what happened inside one OpenClaw agent session?

What logs are useful for AI agent observability?

Why do dashboards still need raw log search?

When should I use Tencent Cloud Log Service for OpenClaw monitoring?

How do I investigate slow or failed OpenClaw operations?

Final checklist

Collect Logs from a Self-Managed Kubernetes Cluster into Tencent Cloud CLS

Prerequisites

How the collection architecture works

Deployment flow

Step 1: define the LogConfig CRD

Step 2: define the LogConfig object

Supported parsing formats

Single-line full text

Multi-line full text

Single-line full regex

Multi-line full regex

JSON logs

Delimiter logs

Supported Kubernetes log sources

Why not only use `kubectl describe`?