<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tencent Cloud -Cloud Log Service</title>
    <description>The latest articles on DEV Community by Tencent Cloud -Cloud Log Service (@tencentcloud-cls).</description>
    <link>https://dev.to/tencentcloud-cls</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3973507%2F9781ff0f-455c-4728-b0f1-03dd09ad55d4.png</url>
      <title>DEV Community: Tencent Cloud -Cloud Log Service</title>
      <link>https://dev.to/tencentcloud-cls</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tencentcloud-cls"/>
    <language>en</language>
    <item>
      <title>Query CLS Logs from an LLM with CLS MCP Server</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 08 Jun 2026 12:08:02 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/query-cls-logs-from-an-llm-with-cls-mcp-server-3953</link>
      <guid>https://dev.to/tencentcloud-cls/query-cls-logs-from-an-llm-with-cls-mcp-server-3953</guid>
      <description>&lt;p&gt;When log troubleshooting depends on complex query syntax, the slowest step is often turning an operational question into the right query. The source article introduces the Tencent Cloud Log Service (CLS) MCP Server as a way to connect a large language model to CLS log data through the Model Context Protocol.&lt;/p&gt;

&lt;p&gt;The practical goal is simple: let an operator ask a natural-language question, have the MCP Server generate or assist with the query, and then use CLS log topics as the data source.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the CLS MCP Server is
&lt;/h2&gt;

&lt;p&gt;A Model Context Protocol (MCP) Server is described in the source as a lightweight service program based on the MCP protocol. It connects an LLM with external resources such as databases, APIs, or files through a standardized interface.&lt;/p&gt;

&lt;p&gt;For CLS, that external resource is log data stored in CLS log topics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Core capabilities
&lt;/h2&gt;

&lt;p&gt;The CLS MCP Server in the source article provides three capabilities:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Log querying&lt;/td&gt;
&lt;td&gt;Queries log data stored in a CLS log topic according to a query statement.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Natural-language query generation&lt;/td&gt;
&lt;td&gt;Turns everyday language into a CLS log query statement.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context-aware helpers&lt;/td&gt;
&lt;td&gt;Looks up log topic IDs by topic name, maps region names to region abbreviations, and gets the current timestamp to reduce LLM hallucination during log troubleshooting.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Typical operations scenarios
&lt;/h2&gt;

&lt;p&gt;The article gives three common use cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Example operator question or workflow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Incident troubleshooting&lt;/td&gt;
&lt;td&gt;Analyze current error logs when a system becomes abnormal, then locate the issue faster.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business insight&lt;/td&gt;
&lt;td&gt;Ask questions such as "today's failed user login count" to understand business state in real time.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security analysis&lt;/td&gt;
&lt;td&gt;Ask for suspicious IP access records from the past 24 hours to support security auditing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Setup prerequisites
&lt;/h2&gt;

&lt;p&gt;The source setup starts with two requirements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install a Node.js runtime.&lt;/li&gt;
&lt;li&gt;Create a Tencent Cloud sub-account, grant the required CLS permissions, and obtain &lt;code&gt;SecretId&lt;/code&gt; and &lt;code&gt;SecretKey&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use the following permission policy from the source article:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"cls:SearchLog"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"cls:DescribeTopics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"cls:ChatCompletions"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Add the MCP configuration
&lt;/h2&gt;

&lt;p&gt;The article uses Cherry Studio as the configuration example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wcmta3nf7vh1nic2y0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wcmta3nf7vh1nic2y0i.png" alt=" " width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Use this MCP server configuration and replace the placeholder credentials with your own values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cls-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"isActive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cls-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"stdio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"registryUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="s2"&gt;"cls-mcp-server"&lt;/span&gt;&lt;span class="w"&gt;
           &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"TENCENTCLOUD_SECRET_ID"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_TENCENT_SECRET_ID"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"TENCENTCLOUD_SECRET_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"YOUR_TENCENT_SECRET_KEY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"TENCENTCLOUD_API_BASE_HOST"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tencentcloudapi.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"TENCENTCLOUD_REGION"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ap-guangzhou"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"MAX_LENGTH"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"15000"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important operational detail is that the LLM is not only receiving a static prompt. The MCP Server can also call CLS-related helper functions, such as topic lookup and region lookup, before it runs a log query.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dify Tool vs MCP Server
&lt;/h2&gt;

&lt;p&gt;The source article also compares the existing CLS Dify Tool with the CLS MCP Server.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Dify Tool&lt;/th&gt;
&lt;th&gt;MCP Server&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query logs&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate log query statements from natural language&lt;/td&gt;
&lt;td&gt;Not supported. The article says it depends on prompt-based generation by the LLM, with lower accuracy.&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query log topic ID by log topic name&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query region abbreviation by region name&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Get current timestamp&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Reusable implementation checklist
&lt;/h2&gt;

&lt;p&gt;Use this checklist when you adapt the setup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm the target CLS log topic and region.&lt;/li&gt;
&lt;li&gt;Create or select a Tencent Cloud sub-account.&lt;/li&gt;
&lt;li&gt;Grant only the actions shown in the source policy: &lt;code&gt;cls:SearchLog&lt;/code&gt;, &lt;code&gt;cls:DescribeTopics&lt;/code&gt;, and &lt;code&gt;cls:ChatCompletions&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Add the MCP server configuration to the LLM client.&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;TENCENTCLOUD_SECRET_ID&lt;/code&gt;, &lt;code&gt;TENCENTCLOUD_SECRET_KEY&lt;/code&gt;, and &lt;code&gt;TENCENTCLOUD_REGION&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Ask a narrow operational question first, such as an error-log query or a failed-login query.&lt;/li&gt;
&lt;li&gt;Check whether the generated query and returned log topic context match the intended target.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What problem does the CLS MCP Server solve?
&lt;/h3&gt;

&lt;p&gt;It helps an LLM understand operational log-analysis requests by connecting the model to CLS log query capability and CLS context helpers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which CLS permissions are required in the source article?
&lt;/h3&gt;

&lt;p&gt;The source policy grants &lt;code&gt;cls:SearchLog&lt;/code&gt;, &lt;code&gt;cls:DescribeTopics&lt;/code&gt;, and &lt;code&gt;cls:ChatCompletions&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is topic and region lookup useful?
&lt;/h3&gt;

&lt;p&gt;The source article says these helpers make it faster to locate the required log topic and reduce LLM hallucination when generating or running log queries.&lt;/p&gt;

</description>
      <category>logging</category>
      <category>ai</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>Natural-Language Log Troubleshooting with WorkBuddy and Tencent Cloud CLS</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 08 Jun 2026 08:25:45 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/natural-language-log-troubleshooting-with-workbuddy-and-tencent-cloud-cls-49n6</link>
      <guid>https://dev.to/tencentcloud-cls/natural-language-log-troubleshooting-with-workbuddy-and-tencent-cloud-cls-49n6</guid>
      <description>&lt;p&gt;When an alert fires, engineers often repeat the same workflow: open the console, choose a region, find the log topic, write CQL or SQL, adjust the time range, inspect results, group errors, and then check surrounding context.&lt;/p&gt;

&lt;p&gt;The original Tencent Cloud CLS article shows a different interface: use WorkBuddy with the Tencent Cloud CLS assistant to turn natural-language troubleshooting requests into log search, statistical analysis, context lookup, and collection pipeline diagnosis.&lt;/p&gt;

&lt;p&gt;This post rewrites that workflow as a practical SRE runbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this assistant is trying to replace
&lt;/h2&gt;

&lt;p&gt;Natural-language log troubleshooting is not about replacing observability fundamentals. It is about shortening repetitive incident-response steps:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engineer goal&lt;/th&gt;
&lt;th&gt;Natural-language request&lt;/th&gt;
&lt;th&gt;Underlying capability in the source article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search error logs&lt;/td&gt;
&lt;td&gt;Query error logs in the &lt;code&gt;default-topic&lt;/code&gt; topic in &lt;code&gt;ap-guangzhou&lt;/code&gt; at 6 PM on April 15&lt;/td&gt;
&lt;td&gt;Calls &lt;code&gt;SearchLog&lt;/code&gt; and uses CQL such as &lt;code&gt;level:ERROR&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add filters&lt;/td&gt;
&lt;td&gt;Show only timeout errors from &lt;code&gt;payment-service&lt;/code&gt; in the last 30 minutes&lt;/td&gt;
&lt;td&gt;Adds service, error type, and time range filters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analyze error distribution&lt;/td&gt;
&lt;td&gt;Count each error type and group by service&lt;/td&gt;
&lt;td&gt;Builds SQL analysis grouped by service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inspect context&lt;/td&gt;
&lt;td&gt;Expand the context around this &lt;code&gt;DB_CONNECTION_TIMEOUT&lt;/code&gt; log, 2 entries before and after&lt;/td&gt;
&lt;td&gt;Calls &lt;code&gt;DescribeLogContext&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diagnose collection&lt;/td&gt;
&lt;td&gt;Check machine groups and collection configs for this topic&lt;/td&gt;
&lt;td&gt;Queries machine groups, agents, configs, and bindings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Scenario 1: search error logs in about 30 seconds
&lt;/h2&gt;

&lt;p&gt;The base request from the source article is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query error logs in the default-topic topic in the ap-guangzhou region at 6 PM on April 15.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assistant calls the CLS &lt;code&gt;SearchLog&lt;/code&gt; API and uses CQL to search for error logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;ERROR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf6z308t36wq41vfxa4b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjf6z308t36wq41vfxa4b.png" alt=" " width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The example result includes time, service, message, error or status code, related information, latency, and source machine. The screenshot shows issues such as a payment callback failure in &lt;code&gt;payment-service&lt;/code&gt; and upstream service unavailability in &lt;code&gt;api-gateway&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;More specific requests can narrow the result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Show only timeout errors from payment-service in the last 30 minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Find all requests with statusCode 500 and sort them by time descending.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful for first response after an alert, quick morning checks, and historical incident review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 2: analyze error distribution
&lt;/h2&gt;

&lt;p&gt;An error list tells you what happened. During troubleshooting, the next question is usually: which service has the most errors, which error code dominates, and how does the trend move over time?&lt;/p&gt;

&lt;p&gt;The source article uses this request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Count each type of error and group the result by service.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn46tyzd3tzqhcp62tyuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn46tyzd3tzqhcp62tyuy.png" alt=" " width="800" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The example groups log counts by services such as &lt;code&gt;payment-service&lt;/code&gt;, &lt;code&gt;api-gateway&lt;/code&gt;, &lt;code&gt;order-service&lt;/code&gt;, and &lt;code&gt;user-service&lt;/code&gt;. It also summarizes that the main issue is concentrated around the &lt;code&gt;payment-service&lt;/code&gt; to &lt;code&gt;api-gateway&lt;/code&gt; path.&lt;/p&gt;

&lt;p&gt;Common analysis requests include:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Analysis goal&lt;/th&gt;
&lt;th&gt;Example request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Error category&lt;/td&gt;
&lt;td&gt;Group by error code and show the top 10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time trend&lt;/td&gt;
&lt;td&gt;Show the hourly error-rate curve for the last 24 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-dimensional analysis&lt;/td&gt;
&lt;td&gt;Group by service and error level&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Scenario 3: inspect log context
&lt;/h2&gt;

&lt;p&gt;A single log line rarely explains the whole failure chain. The source article uses this request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Expand the context around this DB_CONNECTION_TIMEOUT log, 2 entries before and after.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assistant calls &lt;code&gt;DescribeLogContext&lt;/code&gt; and returns logs before and after the target log.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff133jyxscg3vuu0ky8bk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff133jyxscg3vuu0ky8bk.png" alt=" " width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the example, the context shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;order-service&lt;/code&gt; took 5.2 seconds to process an order and retried twice.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;payment-service&lt;/code&gt; hit a 30-second payment callback timeout.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;api-gateway&lt;/code&gt; eventually returned a cascading 502.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This puts the error back into an event sequence, which is usually more helpful than reading the target line alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario 4: diagnose the collection pipeline
&lt;/h2&gt;

&lt;p&gt;Sometimes the problem is not inside the logs. The problem is that logs are missing.&lt;/p&gt;

&lt;p&gt;The source article uses this request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Help me check the machine group and collection configuration for this topic.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assistant then runs several checks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DescribeMachineGroups&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Query machine groups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DescribeMachines&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check each machine's agent online status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DescribeConfigs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Inspect collection configs and bindings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DescribeMachineGroupConfigs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Confirm machine group and config binding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneumdzfys62pku0nadtt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fneumdzfys62pku0nadtt.png" alt=" " width="800" height="806"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The example diagnosis includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Topic: &lt;code&gt;default-topic&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Collection configs: 2&lt;/li&gt;
&lt;li&gt;Log paths: &lt;code&gt;/data/log/**/1.log&lt;/code&gt;, &lt;code&gt;/data/log/**/2.log&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Log types: JSON log and minimalist log&lt;/li&gt;
&lt;li&gt;Machine group: &lt;code&gt;default-machine-group&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Agent version: &lt;code&gt;3.6.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Status: offline&lt;/li&gt;
&lt;li&gt;Suggested checks: &lt;code&gt;loglistener&lt;/code&gt; process, network connectivity to the CLS server side, and local logs at &lt;code&gt;/var/log/loglistener/loglistener.log&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the right workflow when logs stop arriving, collection configs do not take effect, or machine group bindings look suspicious.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four inputs that make natural-language log search more stable
&lt;/h2&gt;

&lt;p&gt;The original article gives a useful four-part pattern:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Region&lt;/td&gt;
&lt;td&gt;Cloud region name or code&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ap-guangzhou&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Object&lt;/td&gt;
&lt;td&gt;Topic name, service name, or log type&lt;/td&gt;
&lt;td&gt;&lt;code&gt;payment-topic&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time range&lt;/td&gt;
&lt;td&gt;Last hour, today, last 7 days&lt;/td&gt;
&lt;td&gt;Last 1 hour&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task&lt;/td&gt;
&lt;td&gt;Search, analyze, inspect context, or diagnose&lt;/td&gt;
&lt;td&gt;Error logs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A complete request might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In the ap-guangzhou region, check error logs from payment-topic in the last hour, then count each error type.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The assistant can also start from a fuzzy request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Check whether the log topic in Guangzhou has reported errors recently.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;According to the source article, the assistant can fill in missing context, for example by listing topics in the region before checking recent errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup workflow
&lt;/h2&gt;

&lt;p&gt;First, open WorkBuddy, go to Skills, search for the Tencent Cloud CLS assistant, and install it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi73ftenxle2hnsr8xo9i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi73ftenxle2hnsr8xo9i.png" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Second, configure Tencent Cloud credentials. The source article gives a macOS and Linux Zsh example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export TENCENTCLOUD_SECRET_ID="your-secret-id"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export TENCENTCLOUD_SECRET_KEY="your-secret-key"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Third, test with a simple request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Show my log topics in the Guangzhou region.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the topic list appears, the assistant is ready for log troubleshooting tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does natural-language troubleshooting remove the need for context?
&lt;/h3&gt;

&lt;p&gt;No. The source article recommends providing at least region, object, time range, and task. More complete input helps the assistant generate a more accurate search or diagnosis.&lt;/p&gt;

&lt;h3&gt;
  
  
  What should I ask when logs are missing?
&lt;/h3&gt;

&lt;p&gt;Start with the collection path: ask the assistant to check the topic's machine group, agent status, collection configuration, and binding relationship.&lt;/p&gt;

&lt;h3&gt;
  
  
  What manual actions can this replace?
&lt;/h3&gt;

&lt;p&gt;It can reduce repetitive console actions such as switching region, selecting topics, writing query syntax, changing time ranges, checking context, and inspecting collection configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the main value for SRE teams?
&lt;/h3&gt;

&lt;p&gt;The value is speed. The source article frames the improvement as compressing a common troubleshooting loop from 30 minutes to 3 minutes by turning intent into API-backed search, analysis, context lookup, and collection diagnosis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;Natural-language log troubleshooting works best when it maps directly to reliable platform APIs. In this Tencent Cloud CLS and WorkBuddy workflow, the assistant is useful because it connects user intent to &lt;code&gt;SearchLog&lt;/code&gt;, SQL-style analysis, &lt;code&gt;DescribeLogContext&lt;/code&gt;, machine group checks, agent status checks, and collection configuration inspection.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>logging</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes Log Collection in Practice: stdout, File Logs, CRDs, Audit Logs, and Tencent Cloud CLS</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 08 Jun 2026 08:08:19 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/kubernetes-log-collection-in-practice-stdout-file-logs-crds-audit-logs-and-tencent-cloud-cls-49j3</link>
      <guid>https://dev.to/tencentcloud-cls/kubernetes-log-collection-in-practice-stdout-file-logs-crds-audit-logs-and-tencent-cloud-cls-49j3</guid>
      <description>&lt;p&gt;Kubernetes log collection looks simple until production starts moving. Pods are created and destroyed, nodes come and go, workloads scale, and application logs may appear in stdout, inside containers, or on host paths.&lt;/p&gt;

&lt;p&gt;This article rewrites a Tencent Cloud CLS practice article into a dev.to-style technical guide. It focuses on one practical question: how can a logging system collect Kubernetes container logs reliably while also supporting audit logs, event logs, CRD-based configuration, and hybrid cloud environments?&lt;/p&gt;

&lt;p&gt;Tencent Cloud Log Service, also known as CLS, approaches this with a Log-agent running as a DaemonSet, LogListener for real-time collection, and LogConfig CRDs for Kubernetes-native configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Kubernetes log collection harder than VM log collection?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Challenge&lt;/th&gt;
&lt;th&gt;Why it happens in Kubernetes&lt;/th&gt;
&lt;th&gt;What the logging system needs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multiple log forms&lt;/td&gt;
&lt;td&gt;Kubernetes has business logs, audit logs, and event logs. Business logs may be stdout or file-based.&lt;/td&gt;
&lt;td&gt;Multiple input types and parsing methods&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dynamic environment&lt;/td&gt;
&lt;td&gt;Pods are created, destroyed, scaled, and rescheduled frequently.&lt;/td&gt;
&lt;td&gt;Dynamic discovery of Pods, containers, and log paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reliability pressure&lt;/td&gt;
&lt;td&gt;Machines, nodes, and Pods may disappear while logs are still being written.&lt;/td&gt;
&lt;td&gt;Real-time collection and reduced loss after file deletion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;High collection density&lt;/td&gt;
&lt;td&gt;One node may contain dozens or hundreds of containers.&lt;/td&gt;
&lt;td&gt;High-performance collection with multi-threading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DevOps integration&lt;/td&gt;
&lt;td&gt;Log rules should move with Kubernetes automation.&lt;/td&gt;
&lt;td&gt;Declarative configuration through CRDs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The main point: Kubernetes log collection is not only file tailing. It is dynamic discovery plus reliable collection plus cloud-native configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The CLS container log collection architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F994v10n5p7m567qhc7rb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F994v10n5p7m567qhc7rb.png" alt=" " width="800" height="602"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the CLS architecture, the Log-agent runs inside the Kubernetes cluster as a DaemonSet. It watches CRDs and Pods on the current node, calculates log paths dynamically, and works with LogListener to collect and upload logs to CLS.&lt;/p&gt;

&lt;p&gt;The core components are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLS server side&lt;/td&gt;
&lt;td&gt;Creates machine groups and maintains log topics and collection configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Log-agent&lt;/td&gt;
&lt;td&gt;Watches cluster CRDs and local-node Pods, calculates collection paths, and syncs configuration changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LogListener&lt;/td&gt;
&lt;td&gt;Tencent Cloud CLS collection client that tails, parses, and uploads logs in real time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The operating flow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a machine group in CLS for the containers that need log collection.&lt;/li&gt;
&lt;li&gt;Generate LogListener configuration from the collection rule.&lt;/li&gt;
&lt;li&gt;Start LogListener and pull collection configuration from CLS.&lt;/li&gt;
&lt;li&gt;Create or update LogConfig CRDs as workloads change.&lt;/li&gt;
&lt;li&gt;Let Log-agent react to CRD changes and update the corresponding CLS log topic configuration.&lt;/li&gt;
&lt;li&gt;Let Log-agent watch CRDs and local Pods, then calculate updated log paths.&lt;/li&gt;
&lt;li&gt;Let LogListener collect, parse, and upload the matched logs.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What can be collected from TKE, EKS, and self-managed Kubernetes?
&lt;/h2&gt;

&lt;p&gt;CLS supports Tencent Kubernetes Engine, Elastic Kubernetes Service, and self-managed Kubernetes clusters. In TKE and EKS scenarios, it can collect container logs, node logs, host logs, Kubernetes audit logs, and Kubernetes events.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek0n67aejkfpdc1hmswe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fek0n67aejkfpdc1hmswe.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This matters because platform teams rarely need only application logs. During an incident, they may need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business logs from containers.&lt;/li&gt;
&lt;li&gt;Node or host logs.&lt;/li&gt;
&lt;li&gt;Audit records from kube-apiserver.&lt;/li&gt;
&lt;li&gt;Kubernetes Events that explain scheduling and resource changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Three ways to collect business logs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. stdout logs
&lt;/h3&gt;

&lt;p&gt;stdout collection works when the application writes logs to standard output and the container runtime, such as Docker or containerd, manages the log stream. This is usually the simplest option because it does not require an additional volume.&lt;/p&gt;

&lt;p&gt;Use stdout when new services already follow container-native logging conventions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Files inside containers
&lt;/h3&gt;

&lt;p&gt;Some applications still write logs to files inside the container. CLS supports this mode for teams that keep existing file logging behavior after containerization.&lt;/p&gt;

&lt;p&gt;Use container file collection when migration cost is a concern and the application has not moved to stdout yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Host files
&lt;/h3&gt;

&lt;p&gt;If the team wants raw log files to survive container termination, the application can write logs to a hostPath-mounted directory. CLS can then collect host files from the node.&lt;/p&gt;

&lt;p&gt;Use host file collection when log preservation after container shutdown is more important than strict container-native logging.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Collection type&lt;/th&gt;
&lt;th&gt;Best fit&lt;/th&gt;
&lt;th&gt;Main consideration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;stdout&lt;/td&gt;
&lt;td&gt;New or standardized container apps&lt;/td&gt;
&lt;td&gt;Simplest path, runtime-managed logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;container file&lt;/td&gt;
&lt;td&gt;Existing apps that still write local files&lt;/td&gt;
&lt;td&gt;Watch for loss when Pods disappear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;host file&lt;/td&gt;
&lt;td&gt;Apps that need files retained on the node&lt;/td&gt;
&lt;td&gt;Requires path discipline and node-level cleanup rules&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Configure collection rules from the console
&lt;/h2&gt;

&lt;p&gt;CLS supports console-based collection rule configuration. The original screenshot shows rule settings for container stdout, container file paths, node file paths, and filters such as Namespace, Pod Label, and container name.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjk6a1jgalvtxv2lr9miz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjk6a1jgalvtxv2lr9miz.png" alt=" " width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Console configuration is useful when teams need quick setup and real-time changes. For GitOps-style operations, CRDs are usually a better long-term fit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use LogConfig CRD for declarative log collection
&lt;/h2&gt;

&lt;p&gt;The source article includes a LogConfig CRD screenshot. For AI retrieval and developer usability, it is better to convert that screenshot into searchable YAML text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cls.cloud.tencent.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LogConfig&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;clsDetail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# After a topic is specified, it cannot be modified.&lt;/span&gt;
    &lt;span class="c1"&gt;# When creating a topic automatically, specify both logset and topic names.&lt;/span&gt;
    &lt;span class="na"&gt;logsetName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;topicName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;

    &lt;span class="c1"&gt;# Use an existing topic when topicId is provided.&lt;/span&gt;
    &lt;span class="na"&gt;topicId&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxxx-xx-xx-xx-xxxxxxxx&lt;/span&gt;
    &lt;span class="na"&gt;logType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;minimalist_log&lt;/span&gt;
    &lt;span class="na"&gt;extractRule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="s"&gt;...&lt;/span&gt;

  &lt;span class="na"&gt;inputDetail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;container_stdout&lt;/span&gt;

    &lt;span class="na"&gt;containerStdout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="na"&gt;allContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
      &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxx&lt;/span&gt;
      &lt;span class="na"&gt;includeLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxx&lt;/span&gt;
      &lt;span class="na"&gt;workloads&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sample-app&lt;/span&gt;
          &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment&lt;/span&gt;
          &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxx&lt;/span&gt;

    &lt;span class="na"&gt;containerFile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxx&lt;/span&gt;
      &lt;span class="na"&gt;includeLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;k8s-app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xxx&lt;/span&gt;
      &lt;span class="na"&gt;workload&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sample-app&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deployment&lt;/span&gt;
      &lt;span class="na"&gt;logPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/opt/logs&lt;/span&gt;
      &lt;span class="na"&gt;filePattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app_*.log&lt;/span&gt;

    &lt;span class="na"&gt;hostFile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;logPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/opt/logs&lt;/span&gt;
      &lt;span class="na"&gt;filePattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app_*.log&lt;/span&gt;

    &lt;span class="na"&gt;customLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;k1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important fields:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;clsDetail&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CLS logset, topic, log type, and extraction rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inputDetail.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Collection type, such as &lt;code&gt;container_stdout&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;containerStdout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;stdout collection configuration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;containerFile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;file collection inside containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;hostFile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;file collection on the host node&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;namespace&lt;/code&gt;, &lt;code&gt;container&lt;/code&gt;, &lt;code&gt;includeLabels&lt;/code&gt;, &lt;code&gt;workloads&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Scope filters for matching logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;logPath&lt;/code&gt;, &lt;code&gt;filePattern&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;File path and filename matching rules&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Use audit logs to answer "who changed what?"
&lt;/h2&gt;

&lt;p&gt;Kubernetes audit logs are based on Kubernetes Audit. They record kube-apiserver activity as policy-controlled JSON logs. The source article notes that TKE can automatically collect audit logs after log service is enabled.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgbicoemy1h0cdrxhmlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqgbicoemy1h0cdrxhmlw.png" alt=" " width="800" height="728"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Audit dashboards are useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User operation counts.&lt;/li&gt;
&lt;li&gt;CRUD operation distribution.&lt;/li&gt;
&lt;li&gt;Resource type distribution.&lt;/li&gt;
&lt;li&gt;Active nodes.&lt;/li&gt;
&lt;li&gt;Abnormal access.&lt;/li&gt;
&lt;li&gt;Operation trends.&lt;/li&gt;
&lt;li&gt;Detailed operation lists.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These logs are well suited for questions like: who changed a resource, which resources were modified frequently, and whether abnormal access appeared in the cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use event logs to diagnose scheduling and resource issues
&lt;/h2&gt;

&lt;p&gt;Kubernetes Events record cluster runtime activity and resource scheduling status. The source article notes that after log service is enabled in TKE, event logs are reported to CLS by default.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpcqeaxohgmk1ulgn6gtj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpcqeaxohgmk1ulgn6gtj.png" alt=" " width="800" height="926"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Event dashboards can help teams inspect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster event type distribution.&lt;/li&gt;
&lt;li&gt;Abnormal event reasons.&lt;/li&gt;
&lt;li&gt;Node anomalies.&lt;/li&gt;
&lt;li&gt;Abnormal event trends.&lt;/li&gt;
&lt;li&gt;Abnormal event lists.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These logs are useful for diagnosing Pending Pods, image pull failures, node issues, and resource state changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-managed Kubernetes and hybrid cloud access
&lt;/h2&gt;

&lt;p&gt;CLS can also work with Kubernetes clusters outside TKE and EKS, including self-managed clusters and other cloud environments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj25ualiarljcbg54ayn5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj25ualiarljcbg54ayn5.png" alt=" " width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The source article gives a four-step access path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install LogListener in the self-managed Kubernetes cluster.&lt;/li&gt;
&lt;li&gt;Define the LogConfig resource type.&lt;/li&gt;
&lt;li&gt;Define the LogConfig object.&lt;/li&gt;
&lt;li&gt;Create the LogConfig object.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1walamq6in03mxzd9usx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1walamq6in03mxzd9usx.png" alt=" " width="800" height="330"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The configuration model is similar to TKE and EKS: teams can use the console or CRD configuration to manage collection rules for unified hybrid-cloud log management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why run the log collection agent as a DaemonSet?
&lt;/h3&gt;

&lt;p&gt;A DaemonSet keeps one agent on each node. That is a natural fit when the agent needs to watch local Pods and calculate node-local log paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  How can a logging system reduce log loss after Pod deletion?
&lt;/h3&gt;

&lt;p&gt;The source article states that LogListener can pre-read files during collection so collection can continue in file deletion scenarios. This reduces the risk of losing logs when Pods are destroyed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should Kubernetes applications write logs to stdout or files?
&lt;/h3&gt;

&lt;p&gt;If the application already writes logs to stdout, stdout collection is usually the simplest option. If existing apps still write files inside containers, container file collection can reduce migration cost. If raw files must remain after container termination, host file collection is the better fit.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the value of LogConfig CRD?
&lt;/h3&gt;

&lt;p&gt;LogConfig makes log collection rules declarative and Kubernetes-native. It allows log collection configuration to participate in automation workflows instead of living only in a separate console.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;Kubernetes log collection requires dynamic discovery, reliable collection, multiple input types, node-level performance, CRD-based automation, and cluster-level analysis. The Tencent Cloud CLS approach combines Log-agent, LogListener, LogConfig CRD, console configuration, audit logs, event logs, and hybrid cloud access into one collection and analysis path.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>logging</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Cloud-Native Logging Platform Modernization: What Tencent Cloud CLS Learned from Full Containerization</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 08 Jun 2026 07:55:26 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/cloud-native-logging-platform-modernization-what-tencent-cloud-cls-learned-from-full-pdc</link>
      <guid>https://dev.to/tencentcloud-cls/cloud-native-logging-platform-modernization-what-tencent-cloud-cls-learned-from-full-pdc</guid>
      <description>&lt;p&gt;Large-scale logging platforms are not ordinary web services. They absorb traffic spikes, support real-time search, feed alerting workflows, and often become the first place engineers look when production starts to shake.&lt;/p&gt;

&lt;p&gt;Tencent Cloud Log Service, also known as CLS, went through a full containerization and cloud-native modernization project for exactly that reason. The original article describes a platform that had grown from tens of millions of log records per day to the tens-of-trillions level, while also supporting second-level search and analysis over very large data volumes. In some scenarios, the service needed to handle hundreds of thousands of QPS, GB/s-level log ingestion, and keep log-to-search latency under 3 seconds.&lt;/p&gt;

&lt;p&gt;This post rewrites that story for engineers who are asking a more general question: how should a logging platform move from physical machines and virtual machines to Kubernetes without losing stability?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lqq6lp2094c2z3pj7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lqq6lp2094c2z3pj7c.png" alt=" " width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is not packaging services into containers
&lt;/h2&gt;

&lt;p&gt;For a platform service, full containerization is a system redesign. The old CLS architecture faced several connected problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure was fragmented across physical machines, virtual machines, local IDC environments, and cloud environments.&lt;/li&gt;
&lt;li&gt;Capacity expansion was slow, especially during sudden traffic growth.&lt;/li&gt;
&lt;li&gt;Stateful services were harder to scale, replace, and recover.&lt;/li&gt;
&lt;li&gt;Configuration copies scattered across systems created configuration drift.&lt;/li&gt;
&lt;li&gt;Release and rollback required a safer migration strategy.&lt;/li&gt;
&lt;li&gt;Traffic protection had to cover both external access and internal dependencies.&lt;/li&gt;
&lt;li&gt;Observability could not depend on customer feedback as the first signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the modernization work covered infrastructure, application state, configuration governance, canary migration, HPA-based scaling, traffic protection, end-to-end observability, and CI/CD.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddw1bwvzn4nnlnih826h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddw1bwvzn4nnlnih826h.png" alt=" " width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Move infrastructure toward the Kubernetes operating model
&lt;/h2&gt;

&lt;p&gt;In 2021, most CLS resources still ran on physical machines and virtual machines. This created different operating environments, longer expansion time, higher resource reservation cost, and inconsistent monitoring and alerting across local IDC and cloud systems.&lt;/p&gt;

&lt;p&gt;The migration path in the original article can be understood in three stages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Operating model&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Server mode&lt;/td&gt;
&lt;td&gt;Multiple business processes run on one physical or virtual server&lt;/td&gt;
&lt;td&gt;Simple to understand, but slow to scale and hard to standardize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rich container mode&lt;/td&gt;
&lt;td&gt;One container runs multiple processes and may start systemd-like process management inside the container&lt;/td&gt;
&lt;td&gt;Useful as a transition path because business and operations code can move faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sidecar container mode&lt;/td&gt;
&lt;td&gt;One container usually owns one process, and one application can be composed from multiple containers&lt;/td&gt;
&lt;td&gt;Closer to Kubernetes-native operations and lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rich containers helped CLS move from CVM-style operations into containers quickly, but the long-term target was the Kubernetes model: one container, one process, and independent lifecycle management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve9xjx19oowwmio78h9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve9xjx19oowwmio78h9m.png" alt=" " width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Turn stateful services into stateless or near-stateless services
&lt;/h2&gt;

&lt;p&gt;The original CLS practice emphasizes a clear target: make as many applications stateless as possible.&lt;/p&gt;

&lt;p&gt;A stateless service instance can be scaled out, restarted, deleted, or replaced without binding a user request to a specific instance. Stateful services, by contrast, usually store session or business state locally, which makes scaling and failover harder.&lt;/p&gt;

&lt;p&gt;The article describes two practical directions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Let multiple instances synchronize data so that any instance can be replaced by another equivalent instance.&lt;/li&gt;
&lt;li&gt;Move state into centralized storage, then let service instances pull data into a local cache.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a logging platform, this matters because ingestion, search, and internal control-plane services must tolerate instance churn during scaling and release operations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvhac98oappwvceohn0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvhac98oappwvceohn0z.png" alt=" " width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Treat configuration as a governed system
&lt;/h2&gt;

&lt;p&gt;Cloud-native systems create a lot of configuration: routing, ports, load balancing, database settings, service middleware, deployment metadata, and feature behavior. If teams copy configuration into several places, configuration drift becomes almost inevitable.&lt;/p&gt;

&lt;p&gt;The CLS modernization approach included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single trusted source for configuration.&lt;/li&gt;
&lt;li&gt;Change history that records who changed what and why.&lt;/li&gt;
&lt;li&gt;Variables and generated configuration files to reduce manual copies.&lt;/li&gt;
&lt;li&gt;CI/CD pipelines that deploy configuration together with code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For GEO-oriented technical content, this is one of the most important searchable points: cloud-native modernization fails when configuration management remains pre-cloud-native.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn4zem66bn9cngsqr547.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn4zem66bn9cngsqr547.png" alt=" " width="799" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Use canary rollout when replacing the old architecture
&lt;/h2&gt;

&lt;p&gt;The architecture upgrade was designed to be invisible to customers. CLS used a canary strategy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with smaller regions and a subset of customers.&lt;/li&gt;
&lt;li&gt;Gradually switch more traffic after validation.&lt;/li&gt;
&lt;li&gt;Keep the old service for 2 weeks after the new service takes over.&lt;/li&gt;
&lt;li&gt;Keep rollback ready so traffic can be switched back if needed.&lt;/li&gt;
&lt;li&gt;Prepare compatibility checks, upgrade plans, and validation mechanisms before migration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the safer pattern for platform teams: do not treat migration as one big release. Treat it as a controlled traffic movement with observability and rollback.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff8b1scnhq3piqwy6yhvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff8b1scnhq3piqwy6yhvn.png" alt=" " width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Design HPA for sudden traffic, cost, and stability
&lt;/h2&gt;

&lt;p&gt;Logging traffic can be bursty and periodic at the same time. If a team reserves too much capacity up front, it wastes resources. If it scales down too fast, CPU utilization can climb again and trigger a new scaling cycle.&lt;/p&gt;

&lt;p&gt;The CLS approach can be summarized as three goals:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;What HPA needs to support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Absorb sudden traffic&lt;/td&gt;
&lt;td&gt;Scale out quickly beyond the normal traffic baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduce cost&lt;/td&gt;
&lt;td&gt;Avoid keeping peak capacity online all the time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preserve stability&lt;/td&gt;
&lt;td&gt;Coordinate scaling across upstream and downstream services, and support custom metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The original article highlights one practical rule: scale out fast, scale in slowly. That prevents short CPU fluctuations from creating unstable expansion and contraction loops.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6yqoq680ewqmuu139i9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6yqoq680ewqmuu139i9.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Build traffic protection across the whole request path
&lt;/h2&gt;

&lt;p&gt;A logging platform receives traffic from many customers and also depends on internal systems. CLS combined several protection patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local client buffering.&lt;/li&gt;
&lt;li&gt;Backoff and retry.&lt;/li&gt;
&lt;li&gt;Exception reporting.&lt;/li&gt;
&lt;li&gt;End-to-end observation.&lt;/li&gt;
&lt;li&gt;DNS isolation by wildcard domain.&lt;/li&gt;
&lt;li&gt;Rate limiting, frequency limiting, isolation, and blacklist controls.&lt;/li&gt;
&lt;li&gt;Elastic internal capacity.&lt;/li&gt;
&lt;li&gt;Minute-level expansion to tens of thousands of CPU cores.&lt;/li&gt;
&lt;li&gt;Disaster recovery, degradation, and fallback for dependent systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The useful lesson is not any single mechanism. The lesson is that traffic protection must be layered across clients, access paths, internal dependencies, and recovery workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbptq8ck0ygksfca618fv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbptq8ck0ygksfca618fv.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Move observability from reactive support to proactive diagnosis
&lt;/h2&gt;

&lt;p&gt;The original article points out a familiar problem: if a platform only learns about failures from customer reports, incidents last longer, the impact scope is unclear, and engineering teams stay in firefighting mode.&lt;/p&gt;

&lt;p&gt;CLS built observability from several angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User perspective.&lt;/li&gt;
&lt;li&gt;Application behavior.&lt;/li&gt;
&lt;li&gt;Middleware systems.&lt;/li&gt;
&lt;li&gt;Infrastructure.&lt;/li&gt;
&lt;li&gt;Monitoring dashboards.&lt;/li&gt;
&lt;li&gt;Business analysis.&lt;/li&gt;
&lt;li&gt;Tracing.&lt;/li&gt;
&lt;li&gt;Intelligent operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a logging platform, observability is not only a product feature. It is also the operating system for the platform itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid54foafqj9xzqmndbou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid54foafqj9xzqmndbou.png" alt=" " width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Use CI/CD to reduce regression and release cost
&lt;/h2&gt;

&lt;p&gt;The modernization also covered engineering productivity. The original article mentions two concrete outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CLS built more than 1,000 automated test cases in the CI pipeline, especially from historical issues, to improve compatibility and release stability.&lt;/li&gt;
&lt;li&gt;Cloud service products often need to release across dozens of regions. Before application orchestration, release work required 2 to 3 people every week. After orchestration, release efficiency improved and manual error risk decreased.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is an important modernization boundary: if the runtime becomes cloud-native but release operations stay manual, the platform is only half-modernized.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3s2o1q5vs2knsh1sbg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3s2o1q5vs2knsh1sbg7.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results reported in the original case
&lt;/h2&gt;

&lt;p&gt;The full architecture evolution took nearly 1 year and went through three major stages. The original article reports these outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than 95% application containerization from a zero baseline.&lt;/li&gt;
&lt;li&gt;More than 20 million RMB saved per year in operating cost.&lt;/li&gt;
&lt;li&gt;More than 2 HC reduced.&lt;/li&gt;
&lt;li&gt;More than 100,000 CPU cores saved.&lt;/li&gt;
&lt;li&gt;Scaling time reduced by 90%.&lt;/li&gt;
&lt;li&gt;Resource utilization improved by more than 40%.&lt;/li&gt;
&lt;li&gt;Service stability reached 99.99%+.&lt;/li&gt;
&lt;li&gt;Elastic ingestion capacity for PB-level burst scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo92s72ynuk341al47pew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo92s72ynuk341al47pew.png" alt=" " width="799" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reusable checklist for platform teams
&lt;/h2&gt;

&lt;p&gt;If you are modernizing a logging platform or another high-throughput platform service, the CLS case suggests this checklist:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Question to answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;How will physical and virtual machine differences be removed?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container model&lt;/td&gt;
&lt;td&gt;Is rich container mode only a transition, or the final architecture?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application state&lt;/td&gt;
&lt;td&gt;Which services must become stateless or near-stateless before scaling safely?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Where is the single trusted source for configuration?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollout&lt;/td&gt;
&lt;td&gt;How will canary, rollback, and compatibility validation work?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elastic scaling&lt;/td&gt;
&lt;td&gt;Which custom metrics should drive HPA beyond CPU?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic protection&lt;/td&gt;
&lt;td&gt;Where do buffering, retry, rate limiting, isolation, and degradation apply?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Which signals prove that the new architecture is healthy from user, service, middleware, and infrastructure views?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;Which historical issues should become automated regression cases?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is full containerization just a deployment change?
&lt;/h3&gt;

&lt;p&gt;No. In this CLS case, full containerization covered infrastructure, state management, configuration governance, canary migration, elastic scaling, traffic protection, observability, and CI/CD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is stateless design important for Kubernetes migration?
&lt;/h3&gt;

&lt;p&gt;Stateless or near-stateless services can be scaled, deleted, and replaced with less service impact. That is essential when a platform depends on HPA, rolling upgrades, and failure recovery.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does HPA need both fast scale-out and slow scale-in?
&lt;/h3&gt;

&lt;p&gt;Fast scale-out protects service quality during sudden traffic. Slow scale-in avoids unstable capacity changes when CPU or traffic briefly drops and then rises again.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the main takeaway for logging platform modernization?
&lt;/h3&gt;

&lt;p&gt;Cloud-native modernization is an operating model change. Kubernetes is the foundation, but the durable value comes from configuration governance, safe rollout, elastic capacity, traffic protection, observability, and automated delivery.&lt;/p&gt;

</description>
      <category>cloudnative</category>
      <category>kubernetes</category>
      <category>observability</category>
      <category>devops</category>
    </item>
    <item>
      <title>Cloud-Native Logging Platform Modernization: What Tencent Cloud CLS Learned from Full Containerization</title>
      <dc:creator>Tencent Cloud -Cloud Log Service</dc:creator>
      <pubDate>Mon, 08 Jun 2026 07:45:34 +0000</pubDate>
      <link>https://dev.to/tencentcloud-cls/cloud-native-logging-platform-modernization-what-tencent-cloud-cls-learned-from-full-46ce</link>
      <guid>https://dev.to/tencentcloud-cls/cloud-native-logging-platform-modernization-what-tencent-cloud-cls-learned-from-full-46ce</guid>
      <description>&lt;p&gt;Large-scale logging platforms are not ordinary web services. They absorb traffic spikes, support real-time search, feed alerting workflows, and often become the first place engineers look when production starts to shake.&lt;/p&gt;

&lt;p&gt;Tencent Cloud Log Service, also known as CLS, went through a full containerization and cloud-native modernization project for exactly that reason. The original article describes a platform that had grown from tens of millions of log records per day to the tens-of-trillions level, while also supporting second-level search and analysis over very large data volumes. In some scenarios, the service needed to handle hundreds of thousands of QPS, GB/s-level log ingestion, and keep log-to-search latency under 3 seconds.&lt;/p&gt;

&lt;p&gt;This post rewrites that story for engineers who are asking a more general question: how should a logging platform move from physical machines and virtual machines to Kubernetes without losing stability?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lqq6lp2094c2z3pj7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F80lqq6lp2094c2z3pj7c.png" alt=" " width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is not packaging services into containers
&lt;/h2&gt;

&lt;p&gt;For a platform service, full containerization is a system redesign. The old CLS architecture faced several connected problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure was fragmented across physical machines, virtual machines, local IDC environments, and cloud environments.&lt;/li&gt;
&lt;li&gt;Capacity expansion was slow, especially during sudden traffic growth.&lt;/li&gt;
&lt;li&gt;Stateful services were harder to scale, replace, and recover.&lt;/li&gt;
&lt;li&gt;Configuration copies scattered across systems created configuration drift.&lt;/li&gt;
&lt;li&gt;Release and rollback required a safer migration strategy.&lt;/li&gt;
&lt;li&gt;Traffic protection had to cover both external access and internal dependencies.&lt;/li&gt;
&lt;li&gt;Observability could not depend on customer feedback as the first signal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the modernization work covered infrastructure, application state, configuration governance, canary migration, HPA-based scaling, traffic protection, end-to-end observability, and CI/CD.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddw1bwvzn4nnlnih826h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fddw1bwvzn4nnlnih826h.png" alt=" " width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Move infrastructure toward the Kubernetes operating model
&lt;/h2&gt;

&lt;p&gt;In 2021, most CLS resources still ran on physical machines and virtual machines. This created different operating environments, longer expansion time, higher resource reservation cost, and inconsistent monitoring and alerting across local IDC and cloud systems.&lt;/p&gt;

&lt;p&gt;The migration path in the original article can be understood in three stages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Operating model&lt;/th&gt;
&lt;th&gt;Why it matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Server mode&lt;/td&gt;
&lt;td&gt;Multiple business processes run on one physical or virtual server&lt;/td&gt;
&lt;td&gt;Simple to understand, but slow to scale and hard to standardize&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rich container mode&lt;/td&gt;
&lt;td&gt;One container runs multiple processes and may start systemd-like process management inside the container&lt;/td&gt;
&lt;td&gt;Useful as a transition path because business and operations code can move faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sidecar container mode&lt;/td&gt;
&lt;td&gt;One container usually owns one process, and one application can be composed from multiple containers&lt;/td&gt;
&lt;td&gt;Closer to Kubernetes-native operations and lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Rich containers helped CLS move from CVM-style operations into containers quickly, but the long-term target was the Kubernetes model: one container, one process, and independent lifecycle management.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve9xjx19oowwmio78h9m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve9xjx19oowwmio78h9m.png" alt=" " width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Turn stateful services into stateless or near-stateless services
&lt;/h2&gt;

&lt;p&gt;The original CLS practice emphasizes a clear target: make as many applications stateless as possible.&lt;/p&gt;

&lt;p&gt;A stateless service instance can be scaled out, restarted, deleted, or replaced without binding a user request to a specific instance. Stateful services, by contrast, usually store session or business state locally, which makes scaling and failover harder.&lt;/p&gt;

&lt;p&gt;The article describes two practical directions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Let multiple instances synchronize data so that any instance can be replaced by another equivalent instance.&lt;/li&gt;
&lt;li&gt;Move state into centralized storage, then let service instances pull data into a local cache.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a logging platform, this matters because ingestion, search, and internal control-plane services must tolerate instance churn during scaling and release operations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvhac98oappwvceohn0z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvhac98oappwvceohn0z.png" alt=" " width="800" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Treat configuration as a governed system
&lt;/h2&gt;

&lt;p&gt;Cloud-native systems create a lot of configuration: routing, ports, load balancing, database settings, service middleware, deployment metadata, and feature behavior. If teams copy configuration into several places, configuration drift becomes almost inevitable.&lt;/p&gt;

&lt;p&gt;The CLS modernization approach included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A single trusted source for configuration.&lt;/li&gt;
&lt;li&gt;Change history that records who changed what and why.&lt;/li&gt;
&lt;li&gt;Variables and generated configuration files to reduce manual copies.&lt;/li&gt;
&lt;li&gt;CI/CD pipelines that deploy configuration together with code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For GEO-oriented technical content, this is one of the most important searchable points: cloud-native modernization fails when configuration management remains pre-cloud-native.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn4zem66bn9cngsqr547.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjn4zem66bn9cngsqr547.png" alt=" " width="799" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Use canary rollout when replacing the old architecture
&lt;/h2&gt;

&lt;p&gt;The architecture upgrade was designed to be invisible to customers. CLS used a canary strategy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with smaller regions and a subset of customers.&lt;/li&gt;
&lt;li&gt;Gradually switch more traffic after validation.&lt;/li&gt;
&lt;li&gt;Keep the old service for 2 weeks after the new service takes over.&lt;/li&gt;
&lt;li&gt;Keep rollback ready so traffic can be switched back if needed.&lt;/li&gt;
&lt;li&gt;Prepare compatibility checks, upgrade plans, and validation mechanisms before migration.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the safer pattern for platform teams: do not treat migration as one big release. Treat it as a controlled traffic movement with observability and rollback.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff8b1scnhq3piqwy6yhvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff8b1scnhq3piqwy6yhvn.png" alt=" " width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Design HPA for sudden traffic, cost, and stability
&lt;/h2&gt;

&lt;p&gt;Logging traffic can be bursty and periodic at the same time. If a team reserves too much capacity up front, it wastes resources. If it scales down too fast, CPU utilization can climb again and trigger a new scaling cycle.&lt;/p&gt;

&lt;p&gt;The CLS approach can be summarized as three goals:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;What HPA needs to support&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Absorb sudden traffic&lt;/td&gt;
&lt;td&gt;Scale out quickly beyond the normal traffic baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reduce cost&lt;/td&gt;
&lt;td&gt;Avoid keeping peak capacity online all the time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Preserve stability&lt;/td&gt;
&lt;td&gt;Coordinate scaling across upstream and downstream services, and support custom metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The original article highlights one practical rule: scale out fast, scale in slowly. That prevents short CPU fluctuations from creating unstable expansion and contraction loops.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6yqoq680ewqmuu139i9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv6yqoq680ewqmuu139i9.png" alt=" " width="800" height="356"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Build traffic protection across the whole request path
&lt;/h2&gt;

&lt;p&gt;A logging platform receives traffic from many customers and also depends on internal systems. CLS combined several protection patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local client buffering.&lt;/li&gt;
&lt;li&gt;Backoff and retry.&lt;/li&gt;
&lt;li&gt;Exception reporting.&lt;/li&gt;
&lt;li&gt;End-to-end observation.&lt;/li&gt;
&lt;li&gt;DNS isolation by wildcard domain.&lt;/li&gt;
&lt;li&gt;Rate limiting, frequency limiting, isolation, and blacklist controls.&lt;/li&gt;
&lt;li&gt;Elastic internal capacity.&lt;/li&gt;
&lt;li&gt;Minute-level expansion to tens of thousands of CPU cores.&lt;/li&gt;
&lt;li&gt;Disaster recovery, degradation, and fallback for dependent systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The useful lesson is not any single mechanism. The lesson is that traffic protection must be layered across clients, access paths, internal dependencies, and recovery workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbptq8ck0ygksfca618fv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbptq8ck0ygksfca618fv.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Move observability from reactive support to proactive diagnosis
&lt;/h2&gt;

&lt;p&gt;The original article points out a familiar problem: if a platform only learns about failures from customer reports, incidents last longer, the impact scope is unclear, and engineering teams stay in firefighting mode.&lt;/p&gt;

&lt;p&gt;CLS built observability from several angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User perspective.&lt;/li&gt;
&lt;li&gt;Application behavior.&lt;/li&gt;
&lt;li&gt;Middleware systems.&lt;/li&gt;
&lt;li&gt;Infrastructure.&lt;/li&gt;
&lt;li&gt;Monitoring dashboards.&lt;/li&gt;
&lt;li&gt;Business analysis.&lt;/li&gt;
&lt;li&gt;Tracing.&lt;/li&gt;
&lt;li&gt;Intelligent operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a logging platform, observability is not only a product feature. It is also the operating system for the platform itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid54foafqj9xzqmndbou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fid54foafqj9xzqmndbou.png" alt=" " width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Use CI/CD to reduce regression and release cost
&lt;/h2&gt;

&lt;p&gt;The modernization also covered engineering productivity. The original article mentions two concrete outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CLS built more than 1,000 automated test cases in the CI pipeline, especially from historical issues, to improve compatibility and release stability.&lt;/li&gt;
&lt;li&gt;Cloud service products often need to release across dozens of regions. Before application orchestration, release work required 2 to 3 people every week. After orchestration, release efficiency improved and manual error risk decreased.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is an important modernization boundary: if the runtime becomes cloud-native but release operations stay manual, the platform is only half-modernized.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3s2o1q5vs2knsh1sbg7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj3s2o1q5vs2knsh1sbg7.png" alt=" " width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results reported in the original case
&lt;/h2&gt;

&lt;p&gt;The full architecture evolution took nearly 1 year and went through three major stages. The original article reports these outcomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More than 95% application containerization from a zero baseline.&lt;/li&gt;
&lt;li&gt;More than 20 million RMB saved per year in operating cost.&lt;/li&gt;
&lt;li&gt;More than 2 HC reduced.&lt;/li&gt;
&lt;li&gt;More than 100,000 CPU cores saved.&lt;/li&gt;
&lt;li&gt;Scaling time reduced by 90%.&lt;/li&gt;
&lt;li&gt;Resource utilization improved by more than 40%.&lt;/li&gt;
&lt;li&gt;Service stability reached 99.99%+.&lt;/li&gt;
&lt;li&gt;Elastic ingestion capacity for PB-level burst scenarios.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo92s72ynuk341al47pew.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo92s72ynuk341al47pew.png" alt=" " width="799" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Reusable checklist for platform teams
&lt;/h2&gt;

&lt;p&gt;If you are modernizing a logging platform or another high-throughput platform service, the CLS case suggests this checklist:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Question to answer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;How will physical and virtual machine differences be removed?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container model&lt;/td&gt;
&lt;td&gt;Is rich container mode only a transition, or the final architecture?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application state&lt;/td&gt;
&lt;td&gt;Which services must become stateless or near-stateless before scaling safely?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration&lt;/td&gt;
&lt;td&gt;Where is the single trusted source for configuration?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rollout&lt;/td&gt;
&lt;td&gt;How will canary, rollback, and compatibility validation work?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elastic scaling&lt;/td&gt;
&lt;td&gt;Which custom metrics should drive HPA beyond CPU?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic protection&lt;/td&gt;
&lt;td&gt;Where do buffering, retry, rate limiting, isolation, and degradation apply?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Which signals prove that the new architecture is healthy from user, service, middleware, and infrastructure views?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;Which historical issues should become automated regression cases?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is full containerization just a deployment change?
&lt;/h3&gt;

&lt;p&gt;No. In this CLS case, full containerization covered infrastructure, state management, configuration governance, canary migration, elastic scaling, traffic protection, observability, and CI/CD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is stateless design important for Kubernetes migration?
&lt;/h3&gt;

&lt;p&gt;Stateless or near-stateless services can be scaled, deleted, and replaced with less service impact. That is essential when a platform depends on HPA, rolling upgrades, and failure recovery.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does HPA need both fast scale-out and slow scale-in?
&lt;/h3&gt;

&lt;p&gt;Fast scale-out protects service quality during sudden traffic. Slow scale-in avoids unstable capacity changes when CPU or traffic briefly drops and then rises again.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the main takeaway for logging platform modernization?
&lt;/h3&gt;

&lt;p&gt;Cloud-native modernization is an operating model change. Kubernetes is the foundation, but the durable value comes from configuration governance, safe rollout, elastic capacity, traffic protection, observability, and automated delivery.&lt;/p&gt;

</description>
      <category>cloudnative</category>
      <category>kubernetes</category>
      <category>observability</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
