DEV Community

Cover image for Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Troubleshooting Kubernetes Events with TKE and Tencent Cloud CLS

Cluster problems rarely appear from nowhere. Before a service outage becomes visible, Kubernetes often records smaller state changes: node pressure, Pod scheduling, Pod eviction, and cluster autoscaler decisions.

Tencent Kubernetes Engine can send those Events into Tencent Cloud CLS, where they become searchable logs and dashboard data. This gives operators a central way to answer what changed, when it changed, which object was involved, and which component reported it.

What an Event tells you

Kubernetes Events describe state transitions. The useful fields are:

Field What to look for
Type Normal, Warning, or a custom type.
Involved Object Pod, Deployment, Node, or another Kubernetes object.
Source Component such as Scheduler or Kubelet.
Reason Short reason enum.
Message Detailed explanation.
Count How many times it happened.

The core flow is: Kubernetes emits a state-change record, CLS stores it as a log event, and the operator filters by object, component, reason, message, count, and timestamp.

Open Event Search

In TKE, go to Cluster Operations -> Event Search. CLS provides collection, storage, search, analysis, and dashboards for the event stream.

Use the overview when you need warning distribution, affected object types, and event trends. Use global search when you already know the component or object name and need a row-level timeline.

Runbook 1: an abnormal node

Filter by the abnormal node name in the event overview. In this example, the result included a node disk-space warning.

The timeline showed that on 2020-11-25, node 172.16.18.13 became abnormal because disk space was insufficient. Kubelet then tried to evict Pods from the node to reclaim disk space.

That sequence gives you a clean next step: check node disk usage, eviction thresholds, and workload placement before treating it as a generic application failure.

Runbook 2: autoscaler expansion

For node pool autoscaling, query the autoscaler component:

event.source.component:"cluster-autoscaler"
Enter fullscreen mode Exit fullscreen mode

Display these fields:

  • event.reason
  • event.message
  • event.involvedObject.name

Sort by log time descending. The result should work like a compact ledger of autoscaler decisions: workload object, reason, message, and the timestamp of each scaling step.

The event stream showed scale-out around 2020-11-25 20:35:45, triggered by three nginx Pods:

  • nginx-5dbf784b68-tq8rd
  • nginx-5dbf784b68-fpvbx
  • nginx-5dbf784b68-v9jv5

Three nodes were added. Later scale-out did not continue because the node pool had reached its maximum node count.

Checklist

  • Use Events to understand state changes, not only current state.
  • Start with overview dashboards, then filter by object name.
  • For node issues, inspect reason, message, source component, and count.
  • For autoscaling, query cluster-autoscaler and reconstruct the event timeline.
  • Use metrics and logs after Events point you to the right object and time window.

FAQ

Why not only use kubectl describe?

kubectl describe is useful for one object. CLS is better when you need searchable history, dashboards, and cross-object analysis.

What is the fastest autoscaler query?

Start with event.source.component:"cluster-autoscaler" and sort by log time descending.

Top comments (0)