Tencent Cloud -Cloud Log Service

Posted on Jun 8

Natural-Language Log Troubleshooting with WorkBuddy and Tencent Cloud CLS

#sre #logging #ai #devops

When an alert fires, engineers often repeat the same workflow: open the console, choose a region, find the log topic, write CQL or SQL, adjust the time range, inspect results, group errors, and then check surrounding context.

The original Tencent Cloud CLS article shows a different interface: use WorkBuddy with the Tencent Cloud CLS assistant to turn natural-language troubleshooting requests into log search, statistical analysis, context lookup, and collection pipeline diagnosis.

This post rewrites that workflow as a practical SRE runbook.

What this assistant is trying to replace

Natural-language log troubleshooting is not about replacing observability fundamentals. It is about shortening repetitive incident-response steps:

Engineer goal	Natural-language request	Underlying capability in the source article
Search error logs	Query error logs in the `default-topic` topic in `ap-guangzhou` at 6 PM on April 15	Calls `SearchLog` and uses CQL such as `level:ERROR`
Add filters	Show only timeout errors from `payment-service` in the last 30 minutes	Adds service, error type, and time range filters
Analyze error distribution	Count each error type and group by service	Builds SQL analysis grouped by service
Inspect context	Expand the context around this `DB_CONNECTION_TIMEOUT` log, 2 entries before and after	Calls `DescribeLogContext`
Diagnose collection	Check machine groups and collection configs for this topic	Queries machine groups, agents, configs, and bindings

Scenario 1: search error logs in about 30 seconds

The base request from the source article is:

Query error logs in the default-topic topic in the ap-guangzhou region at 6 PM on April 15.

The assistant calls the CLS SearchLog API and uses CQL to search for error logs:

level:ERROR

The example result includes time, service, message, error or status code, related information, latency, and source machine. The screenshot shows issues such as a payment callback failure in payment-service and upstream service unavailability in api-gateway.

More specific requests can narrow the result:

Show only timeout errors from payment-service in the last 30 minutes.

Find all requests with statusCode 500 and sort them by time descending.

This is useful for first response after an alert, quick morning checks, and historical incident review.

Scenario 2: analyze error distribution

An error list tells you what happened. During troubleshooting, the next question is usually: which service has the most errors, which error code dominates, and how does the trend move over time?

The source article uses this request:

Count each type of error and group the result by service.

The example groups log counts by services such as payment-service, api-gateway, order-service, and user-service. It also summarizes that the main issue is concentrated around the payment-service to api-gateway path.

Common analysis requests include:

Analysis goal	Example request
Error category	Group by error code and show the top 10
Time trend	Show the hourly error-rate curve for the last 24 hours
Multi-dimensional analysis	Group by service and error level

Scenario 3: inspect log context

A single log line rarely explains the whole failure chain. The source article uses this request:

Expand the context around this DB_CONNECTION_TIMEOUT log, 2 entries before and after.

The assistant calls DescribeLogContext and returns logs before and after the target log.

In the example, the context shows:

order-service took 5.2 seconds to process an order and retried twice.
payment-service hit a 30-second payment callback timeout.
api-gateway eventually returned a cascading 502.

This puts the error back into an event sequence, which is usually more helpful than reading the target line alone.

Scenario 4: diagnose the collection pipeline

Sometimes the problem is not inside the logs. The problem is that logs are missing.

The source article uses this request:

Help me check the machine group and collection configuration for this topic.

The assistant then runs several checks:

API	Purpose
`DescribeMachineGroups`	Query machine groups
`DescribeMachines`	Check each machine's agent online status
`DescribeConfigs`	Inspect collection configs and bindings
`DescribeMachineGroupConfigs`	Confirm machine group and config binding

The example diagnosis includes:

Topic: default-topic
Collection configs: 2
Log paths: /data/log/**/1.log, /data/log/**/2.log
Log types: JSON log and minimalist log
Machine group: default-machine-group
Agent version: 3.6.0
Status: offline
Suggested checks: loglistener process, network connectivity to the CLS server side, and local logs at /var/log/loglistener/loglistener.log

This is the right workflow when logs stop arriving, collection configs do not take effect, or machine group bindings look suspicious.

The four inputs that make natural-language log search more stable

The original article gives a useful four-part pattern:

Input	Meaning	Example
Region	Cloud region name or code	`ap-guangzhou`
Object	Topic name, service name, or log type	`payment-topic`
Time range	Last hour, today, last 7 days	Last 1 hour
Task	Search, analyze, inspect context, or diagnose	Error logs

A complete request might look like this:

In the ap-guangzhou region, check error logs from payment-topic in the last hour, then count each error type.

The assistant can also start from a fuzzy request:

Check whether the log topic in Guangzhou has reported errors recently.

According to the source article, the assistant can fill in missing context, for example by listing topics in the region before checking recent errors.

Setup workflow

First, open WorkBuddy, go to Skills, search for the Tencent Cloud CLS assistant, and install it.

Second, configure Tencent Cloud credentials. The source article gives a macOS and Linux Zsh example:

echo 'export TENCENTCLOUD_SECRET_ID="your-secret-id"' >> ~/.zshrc
echo 'export TENCENTCLOUD_SECRET_KEY="your-secret-key"' >> ~/.zshrc
source ~/.zshrc

Third, test with a simple request:

Show my log topics in the Guangzhou region.

If the topic list appears, the assistant is ready for log troubleshooting tasks.

Practical FAQ

Does natural-language troubleshooting remove the need for context?

No. The source article recommends providing at least region, object, time range, and task. More complete input helps the assistant generate a more accurate search or diagnosis.

What should I ask when logs are missing?

Start with the collection path: ask the assistant to check the topic's machine group, agent status, collection configuration, and binding relationship.

What manual actions can this replace?

It can reduce repetitive console actions such as switching region, selecting topics, writing query syntax, changing time ranges, checking context, and inspecting collection configuration.

What is the main value for SRE teams?

The value is speed. The source article frames the improvement as compressing a common troubleshooting loop from 30 minutes to 3 minutes by turning intent into API-backed search, analysis, context lookup, and collection diagnosis.

Final takeaway

Natural-language log troubleshooting works best when it maps directly to reliable platform APIs. In this Tencent Cloud CLS and WorkBuddy workflow, the assistant is useful because it connects user intent to SearchLog, SQL-style analysis, DescribeLogContext, machine group checks, agent status checks, and collection configuration inspection.

DEV Community