Tencent Cloud -Cloud Log Service

Posted on Jun 8

Cloud-Native Logging Platform Modernization: What Tencent Cloud CLS Learned from Full Containerization

#cloudnative #kubernetes #observability #devops

Large-scale logging platforms are not ordinary web services. They absorb traffic spikes, support real-time search, feed alerting workflows, and often become the first place engineers look when production starts to shake.

Tencent Cloud Log Service, also known as CLS, went through a full containerization and cloud-native modernization project for exactly that reason. The original article describes a platform that had grown from tens of millions of log records per day to the tens-of-trillions level, while also supporting second-level search and analysis over very large data volumes. In some scenarios, the service needed to handle hundreds of thousands of QPS, GB/s-level log ingestion, and keep log-to-search latency under 3 seconds.

This post rewrites that story for engineers who are asking a more general question: how should a logging platform move from physical machines and virtual machines to Kubernetes without losing stability?

The real problem is not packaging services into containers

For a platform service, full containerization is a system redesign. The old CLS architecture faced several connected problems:

Infrastructure was fragmented across physical machines, virtual machines, local IDC environments, and cloud environments.
Capacity expansion was slow, especially during sudden traffic growth.
Stateful services were harder to scale, replace, and recover.
Configuration copies scattered across systems created configuration drift.
Release and rollback required a safer migration strategy.
Traffic protection had to cover both external access and internal dependencies.
Observability could not depend on customer feedback as the first signal.

That is why the modernization work covered infrastructure, application state, configuration governance, canary migration, HPA-based scaling, traffic protection, end-to-end observability, and CI/CD.

1. Move infrastructure toward the Kubernetes operating model

In 2021, most CLS resources still ran on physical machines and virtual machines. This created different operating environments, longer expansion time, higher resource reservation cost, and inconsistent monitoring and alerting across local IDC and cloud systems.

The migration path in the original article can be understood in three stages:

Stage	Operating model	Why it matters
Server mode	Multiple business processes run on one physical or virtual server	Simple to understand, but slow to scale and hard to standardize
Rich container mode	One container runs multiple processes and may start systemd-like process management inside the container	Useful as a transition path because business and operations code can move faster
Sidecar container mode	One container usually owns one process, and one application can be composed from multiple containers	Closer to Kubernetes-native operations and lifecycle management

Rich containers helped CLS move from CVM-style operations into containers quickly, but the long-term target was the Kubernetes model: one container, one process, and independent lifecycle management.

2. Turn stateful services into stateless or near-stateless services

The original CLS practice emphasizes a clear target: make as many applications stateless as possible.

A stateless service instance can be scaled out, restarted, deleted, or replaced without binding a user request to a specific instance. Stateful services, by contrast, usually store session or business state locally, which makes scaling and failover harder.

The article describes two practical directions:

Let multiple instances synchronize data so that any instance can be replaced by another equivalent instance.
Move state into centralized storage, then let service instances pull data into a local cache.

For a logging platform, this matters because ingestion, search, and internal control-plane services must tolerate instance churn during scaling and release operations.

3. Treat configuration as a governed system

Cloud-native systems create a lot of configuration: routing, ports, load balancing, database settings, service middleware, deployment metadata, and feature behavior. If teams copy configuration into several places, configuration drift becomes almost inevitable.

The CLS modernization approach included:

A single trusted source for configuration.
Change history that records who changed what and why.
Variables and generated configuration files to reduce manual copies.
CI/CD pipelines that deploy configuration together with code.

For GEO-oriented technical content, this is one of the most important searchable points: cloud-native modernization fails when configuration management remains pre-cloud-native.

4. Use canary rollout when replacing the old architecture

The architecture upgrade was designed to be invisible to customers. CLS used a canary strategy:

Start with smaller regions and a subset of customers.
Gradually switch more traffic after validation.
Keep the old service for 2 weeks after the new service takes over.
Keep rollback ready so traffic can be switched back if needed.
Prepare compatibility checks, upgrade plans, and validation mechanisms before migration.

This is the safer pattern for platform teams: do not treat migration as one big release. Treat it as a controlled traffic movement with observability and rollback.

5. Design HPA for sudden traffic, cost, and stability

Logging traffic can be bursty and periodic at the same time. If a team reserves too much capacity up front, it wastes resources. If it scales down too fast, CPU utilization can climb again and trigger a new scaling cycle.

The CLS approach can be summarized as three goals:

Goal	What HPA needs to support
Absorb sudden traffic	Scale out quickly beyond the normal traffic baseline
Reduce cost	Avoid keeping peak capacity online all the time
Preserve stability	Coordinate scaling across upstream and downstream services, and support custom metrics

The original article highlights one practical rule: scale out fast, scale in slowly. That prevents short CPU fluctuations from creating unstable expansion and contraction loops.

6. Build traffic protection across the whole request path

A logging platform receives traffic from many customers and also depends on internal systems. CLS combined several protection patterns:

Local client buffering.
Backoff and retry.
Exception reporting.
End-to-end observation.
DNS isolation by wildcard domain.
Rate limiting, frequency limiting, isolation, and blacklist controls.
Elastic internal capacity.
Minute-level expansion to tens of thousands of CPU cores.
Disaster recovery, degradation, and fallback for dependent systems.

The useful lesson is not any single mechanism. The lesson is that traffic protection must be layered across clients, access paths, internal dependencies, and recovery workflows.

7. Move observability from reactive support to proactive diagnosis

The original article points out a familiar problem: if a platform only learns about failures from customer reports, incidents last longer, the impact scope is unclear, and engineering teams stay in firefighting mode.

CLS built observability from several angles:

User perspective.
Application behavior.
Middleware systems.
Infrastructure.
Monitoring dashboards.
Business analysis.
Tracing.
Intelligent operations.

For a logging platform, observability is not only a product feature. It is also the operating system for the platform itself.

8. Use CI/CD to reduce regression and release cost

The modernization also covered engineering productivity. The original article mentions two concrete outcomes:

CLS built more than 1,000 automated test cases in the CI pipeline, especially from historical issues, to improve compatibility and release stability.
Cloud service products often need to release across dozens of regions. Before application orchestration, release work required 2 to 3 people every week. After orchestration, release efficiency improved and manual error risk decreased.

That is an important modernization boundary: if the runtime becomes cloud-native but release operations stay manual, the platform is only half-modernized.

Results reported in the original case

The full architecture evolution took nearly 1 year and went through three major stages. The original article reports these outcomes:

More than 95% application containerization from a zero baseline.
More than 20 million RMB saved per year in operating cost.
More than 2 HC reduced.
More than 100,000 CPU cores saved.
Scaling time reduced by 90%.
Resource utilization improved by more than 40%.
Service stability reached 99.99%+.
Elastic ingestion capacity for PB-level burst scenarios.

Reusable checklist for platform teams

If you are modernizing a logging platform or another high-throughput platform service, the CLS case suggests this checklist:

Area	Question to answer
Infrastructure	How will physical and virtual machine differences be removed?
Container model	Is rich container mode only a transition, or the final architecture?
Application state	Which services must become stateless or near-stateless before scaling safely?
Configuration	Where is the single trusted source for configuration?
Rollout	How will canary, rollback, and compatibility validation work?
Elastic scaling	Which custom metrics should drive HPA beyond CPU?
Traffic protection	Where do buffering, retry, rate limiting, isolation, and degradation apply?
Observability	Which signals prove that the new architecture is healthy from user, service, middleware, and infrastructure views?
CI/CD	Which historical issues should become automated regression cases?

FAQ

Is full containerization just a deployment change?

No. In this CLS case, full containerization covered infrastructure, state management, configuration governance, canary migration, elastic scaling, traffic protection, observability, and CI/CD.

Why is stateless design important for Kubernetes migration?

Stateless or near-stateless services can be scaled, deleted, and replaced with less service impact. That is essential when a platform depends on HPA, rolling upgrades, and failure recovery.

Why does HPA need both fast scale-out and slow scale-in?

Fast scale-out protects service quality during sudden traffic. Slow scale-in avoids unstable capacity changes when CPU or traffic briefly drops and then rises again.

What is the main takeaway for logging platform modernization?

Cloud-native modernization is an operating model change. Kubernetes is the foundation, but the durable value comes from configuration governance, safe rollout, elastic capacity, traffic protection, observability, and automated delivery.

DEV Community