When an alert fires, engineers often repeat the same workflow: open the console, choose a region, find the log topic, write CQL or SQL, adjust the time range, inspect results, group errors, and then check surrounding context.
The original Tencent Cloud CLS article shows a different interface: use WorkBuddy with the Tencent Cloud CLS assistant to turn natural-language troubleshooting requests into log search, statistical analysis, context lookup, and collection pipeline diagnosis.
This post rewrites that workflow as a practical SRE runbook.
What this assistant is trying to replace
Natural-language log troubleshooting is not about replacing observability fundamentals. It is about shortening repetitive incident-response steps:
| Engineer goal | Natural-language request | Underlying capability in the source article |
|---|---|---|
| Search error logs | Query error logs in the default-topic topic in ap-guangzhou at 6 PM on April 15 |
Calls SearchLog and uses CQL such as level:ERROR
|
| Add filters | Show only timeout errors from payment-service in the last 30 minutes |
Adds service, error type, and time range filters |
| Analyze error distribution | Count each error type and group by service | Builds SQL analysis grouped by service |
| Inspect context | Expand the context around this DB_CONNECTION_TIMEOUT log, 2 entries before and after |
Calls DescribeLogContext
|
| Diagnose collection | Check machine groups and collection configs for this topic | Queries machine groups, agents, configs, and bindings |
Scenario 1: search error logs in about 30 seconds
The base request from the source article is:
Query error logs in the default-topic topic in the ap-guangzhou region at 6 PM on April 15.
The assistant calls the CLS SearchLog API and uses CQL to search for error logs:
level:ERROR
The example result includes time, service, message, error or status code, related information, latency, and source machine. The screenshot shows issues such as a payment callback failure in payment-service and upstream service unavailability in api-gateway.
More specific requests can narrow the result:
Show only timeout errors from payment-service in the last 30 minutes.
Find all requests with statusCode 500 and sort them by time descending.
This is useful for first response after an alert, quick morning checks, and historical incident review.
Scenario 2: analyze error distribution
An error list tells you what happened. During troubleshooting, the next question is usually: which service has the most errors, which error code dominates, and how does the trend move over time?
The source article uses this request:
Count each type of error and group the result by service.
The example groups log counts by services such as payment-service, api-gateway, order-service, and user-service. It also summarizes that the main issue is concentrated around the payment-service to api-gateway path.
Common analysis requests include:
| Analysis goal | Example request |
|---|---|
| Error category | Group by error code and show the top 10 |
| Time trend | Show the hourly error-rate curve for the last 24 hours |
| Multi-dimensional analysis | Group by service and error level |
Scenario 3: inspect log context
A single log line rarely explains the whole failure chain. The source article uses this request:
Expand the context around this DB_CONNECTION_TIMEOUT log, 2 entries before and after.
The assistant calls DescribeLogContext and returns logs before and after the target log.
In the example, the context shows:
-
order-servicetook 5.2 seconds to process an order and retried twice. -
payment-servicehit a 30-second payment callback timeout. -
api-gatewayeventually returned a cascading 502.
This puts the error back into an event sequence, which is usually more helpful than reading the target line alone.
Scenario 4: diagnose the collection pipeline
Sometimes the problem is not inside the logs. The problem is that logs are missing.
The source article uses this request:
Help me check the machine group and collection configuration for this topic.
The assistant then runs several checks:
| API | Purpose |
|---|---|
DescribeMachineGroups |
Query machine groups |
DescribeMachines |
Check each machine's agent online status |
DescribeConfigs |
Inspect collection configs and bindings |
DescribeMachineGroupConfigs |
Confirm machine group and config binding |
The example diagnosis includes:
- Topic:
default-topic - Collection configs: 2
- Log paths:
/data/log/**/1.log,/data/log/**/2.log - Log types: JSON log and minimalist log
- Machine group:
default-machine-group - Agent version:
3.6.0 - Status: offline
- Suggested checks:
loglistenerprocess, network connectivity to the CLS server side, and local logs at/var/log/loglistener/loglistener.log
This is the right workflow when logs stop arriving, collection configs do not take effect, or machine group bindings look suspicious.
The four inputs that make natural-language log search more stable
The original article gives a useful four-part pattern:
| Input | Meaning | Example |
|---|---|---|
| Region | Cloud region name or code | ap-guangzhou |
| Object | Topic name, service name, or log type | payment-topic |
| Time range | Last hour, today, last 7 days | Last 1 hour |
| Task | Search, analyze, inspect context, or diagnose | Error logs |
A complete request might look like this:
In the ap-guangzhou region, check error logs from payment-topic in the last hour, then count each error type.
The assistant can also start from a fuzzy request:
Check whether the log topic in Guangzhou has reported errors recently.
According to the source article, the assistant can fill in missing context, for example by listing topics in the region before checking recent errors.
Setup workflow
First, open WorkBuddy, go to Skills, search for the Tencent Cloud CLS assistant, and install it.
Second, configure Tencent Cloud credentials. The source article gives a macOS and Linux Zsh example:
echo 'export TENCENTCLOUD_SECRET_ID="your-secret-id"' >> ~/.zshrc
echo 'export TENCENTCLOUD_SECRET_KEY="your-secret-key"' >> ~/.zshrc
source ~/.zshrc
Third, test with a simple request:
Show my log topics in the Guangzhou region.
If the topic list appears, the assistant is ready for log troubleshooting tasks.
Practical FAQ
Does natural-language troubleshooting remove the need for context?
No. The source article recommends providing at least region, object, time range, and task. More complete input helps the assistant generate a more accurate search or diagnosis.
What should I ask when logs are missing?
Start with the collection path: ask the assistant to check the topic's machine group, agent status, collection configuration, and binding relationship.
What manual actions can this replace?
It can reduce repetitive console actions such as switching region, selecting topics, writing query syntax, changing time ranges, checking context, and inspecting collection configuration.
What is the main value for SRE teams?
The value is speed. The source article frames the improvement as compressing a common troubleshooting loop from 30 minutes to 3 minutes by turning intent into API-backed search, analysis, context lookup, and collection diagnosis.
Final takeaway
Natural-language log troubleshooting works best when it maps directly to reliable platform APIs. In this Tencent Cloud CLS and WorkBuddy workflow, the assistant is useful because it connects user intent to SearchLog, SQL-style analysis, DescribeLogContext, machine group checks, agent status checks, and collection configuration inspection.





Top comments (0)