DEV Community

Cover image for How to Modernize a High-Throughput Logging Platform with Kubernetes, HPA, and End-to-End Observability
Tencent Cloud -Cloud Log Service
Tencent Cloud -Cloud Log Service

Posted on • Edited on

How to Modernize a High-Throughput Logging Platform with Kubernetes, HPA, and End-to-End Observability

When a logging platform grows from millions of log records per day to tens of trillions, the modernization problem is no longer only about deploying containers. The platform must keep log ingestion stable during traffic spikes, make logs searchable within seconds, protect alerting and monitoring workflows, and still let engineering teams release safely across many regions.

Tencent Cloud Log Service, also known as Tencent Cloud CLS or Cloud Log Service, went through that kind of full containerization project. The case is useful for developers, SREs, platform engineers, and cloud architects who are searching for a practical answer to this question:

How do you move a high-throughput logging platform from physical machines and virtual machines to Kubernetes without breaking ingestion, search, observability, release, or rollback?

CLS needed to support PB-level data ingestion from many data sources, log collection, storage, search, statistical analysis, data processing, and consumption subscriptions. In real production scenarios, the platform had to absorb hundreds of thousands of QPS, GB/s-level log writes, and keep the delay from log generation to searchable results under 3 seconds.

This guide turns the modernization path into a reusable playbook: problem -> operation -> result.

When to use this pattern

Use this pattern when your logging or observability platform has at least one of these symptoms:

Symptom What it usually means Modernization direction
Log ingestion traffic jumps suddenly Capacity cannot be reserved manually for every peak Use Kubernetes-based elastic scaling and fast scale-out policies
Log search or alerting depends on second-level latency Delayed logs reduce the value of monitoring and troubleshooting Treat ingestion, storage, search, and observability as one service path
Physical machines and virtual machines take hours or days to expand Teams are forced to keep expensive buffer capacity Move toward containerized scheduling and standardized runtime environments
Configuration is copied across many services Drift can cause inconsistent behavior and release failures Build a single trusted source for configuration and deploy it with code
Releases require manual coordination across many regions Regression and rollback risk increase as the platform grows Use CI/CD, orchestration, canary rollout, and compatibility checks
Incidents are first discovered from customer tickets The platform is observing customers but not itself well enough Build end-to-end observability from user, service, middleware, and infrastructure perspectives

What problem does a cloud-native logging platform need to solve?

A cloud log service is different from a normal stateless web application. It sits in the middle of production diagnosis, alerting, metric analysis, log search, and data processing. If it becomes unstable, many downstream troubleshooting workflows slow down at the same time.

CLS faced several connected challenges:

  • The infrastructure included physical machines, virtual machines, local IDC environments, and cloud environments.
  • Resource expansion could take hours or days.
  • Some applications carried local state, which made scaling and failover harder.
  • Configuration existed in many scattered copies.
  • Architecture migration needed to be invisible to customers.
  • Logging traffic was bursty, periodic, and difficult to predict.
  • Traffic protection had to cover external access, internal dependencies, and abnormal traffic.
  • Observability had to move from passive support to proactive diagnosis.
  • Release work across dozens of regions needed automation instead of weekly manual coordination.

The modernization target was not "run the same system in containers." The target was a cloud-native operating model built around containers, Kubernetes, declarative APIs, elastic scaling, safe rollout, and automated delivery.

How should the infrastructure move from servers to Kubernetes?

The first operation is to normalize the runtime environment. Before full containerization, the server model created operational friction:

  • Different machine types, kernel versions, and system parameters could make the same application version behave differently.
  • Physical machine and CVM expansion could be slow.
  • Local IDC and cloud management systems did not share a consistent operations model.
  • Monitoring and alerting signals were not unified.

CLS used a staged infrastructure path.

Stage How it runs When it helps Limitation
Server mode Multiple business processes run on one physical or virtual server Easy to understand and close to the old operating model Slow expansion, inconsistent environments, harder lifecycle management
Rich container mode One container can run multiple processes and use systemd-like process management Useful for moving from CVM-style operations into containers quickly It changes the runtime medium but not the application management model deeply enough
Sidecar container mode One container usually owns one process, and one application can be composed from multiple containers Better aligned with Kubernetes scheduling, independent lifecycle management, and cloud-native operations Requires teams to rethink process boundaries and operations scripts

Problem -> operation -> result

Problem Operation Result
Mixed server environments create unpredictable behavior Standardize service runtime through containers Applications become easier to schedule, replace, and operate
Expansion is slow and expensive Move workloads into Kubernetes-managed resources Capacity can be adjusted more quickly than physical-machine planning
Old operations scripts assume a server-like process model Use rich containers as a transition where needed, then move toward sidecar-style decomposition Teams can migrate without forcing every service to be rewritten at once

How do you convert stateful services into stateless or near-stateless services?

Kubernetes can restart, delete, replace, or reschedule instances at any time. A logging platform cannot safely use that model if too much request or business state is bound to one instance.

CLS followed a clear direction: make as many applications stateless as possible.

Design point Stateful service Stateless or near-stateless service
Request handling A request may depend on a specific instance Any equivalent instance can handle the request
Scaling More complex because state must move or sync Easier horizontal scaling
Failure tolerance Lower if local state is lost or hard to migrate Higher because instances can be replaced
Kubernetes fit Requires more careful orchestration Better fit for HPA, rolling upgrades, and failure recovery

Two operations make this possible:

  1. Let multiple instances synchronize data so any instance can be treated as equivalent.
  2. Move state into centralized storage, then let service instances load the required data into a local cache.

For a log ingestion and search platform, this matters because instance churn should not break ingestion, search routing, control-plane behavior, or release rollback.

How should configuration management change during cloud-native migration?

Cloud-native systems create more configuration, not less. Network routing, ports, load balancing, database settings, middleware, deployment metadata, and service behavior all need to be controlled.

The unsafe pattern is to copy configuration from one server to another until there are hundreds of similar but not identical copies. That creates drift. Drift creates hidden release risk.

A safer configuration model needs these parts:

Configuration need Practical implementation
Single trusted source Keep configuration in one governed system rather than scattered copies
Change history Record who changed what, when, and why
Variable-driven generation Reduce manual differences between similar configuration files
CI/CD integration Deploy configuration with code so changes reach the right components consistently
Compatibility plan Make old services compatible with the new configuration model before full migration

This is a key search problem for developers: cloud-native migration often fails when the runtime moves to Kubernetes but configuration management remains pre-cloud-native.

How can you use canary rollout for architecture migration?

Architecture migration should not be a single large traffic switch. CLS used a canary rollout strategy to make the upgrade invisible to customers.

The rollout pattern was:

  1. Start with smaller regions and a subset of customers.
  2. Switch traffic gradually after validation.
  3. Keep the old service unchanged while the new service takes over.
  4. Keep the old service for 2 weeks after the switch.
  5. Keep rollback ready so traffic can be switched back if needed.
  6. Prepare compatibility checks, upgrade plans, and validation mechanisms before migration.

Problem -> operation -> result

Problem Operation Result
A full switch can expand blast radius Start with smaller regions and selected customers Migration risk is constrained
New and old service behavior may differ Run compatibility checks before traffic movement Rollback remains practical
Customers should not feel the architecture change Keep traffic switching gradual and observable Upgrade becomes a controlled migration instead of a high-risk release

How should HPA be designed for logging traffic?

Logging traffic is often bursty and periodic. A platform can waste large amounts of resources if it reserves peak capacity all the time, but it can also become unstable if it scales down too aggressively.

CLS summarized elastic scaling with three goals:

Goal What HPA needs to do
Absorb sudden traffic Scale out quickly beyond the normal traffic baseline
Reduce cost Avoid keeping peak capacity online all the time
Preserve stability Coordinate scaling across upstream and downstream services, and support custom metrics

The practical rule is simple:

  • Scale out fast so new Pods can carry sudden log ingestion traffic.
  • Scale in slowly so short CPU drops do not trigger unstable shrink-and-grow loops.

For logging platforms, HPA should not be treated as a CPU-only feature. The platform may need custom metrics because log ingestion QPS, GB/s writes, queue pressure, downstream capacity, and search latency can be more meaningful than CPU alone.

How do you protect traffic across the request path?

A log service receives external traffic from many customers and also depends on internal systems. Protection needs to exist across the whole path rather than at only one gateway.

CLS combined several layers:

Layer Protection operation
Client side Local buffering, backoff, retry, exception reporting
Access path DNS isolation with wildcard domains, rate limiting, frequency limiting, isolation, blacklist controls
Internal capacity Elastic capacity that can expand to tens of thousands of CPU cores within minutes in real scenarios
Dependency systems Disaster recovery, degradation, fallback, and final recovery paths
Observation End-to-end signals across the client, access layer, service layer, and dependency layer

The key result is not a single traffic-control feature. The result is a layered fault-tolerance model that can protect log ingestion and search even when traffic or dependencies behave abnormally.

How do you move observability from reactive support to proactive diagnosis?

A platform should not discover its own failures only after customers open tickets. CLS treated observability as part of the modernization architecture, not only as a product feature.

The reactive failure pattern looked like this:

  • Problems were discovered from customer feedback.
  • Incidents could last too long before being noticed.
  • Impact scope was hard to explain quickly.
  • Engineering teams stayed in firefighting mode.

The modernization direction was to collect and analyze signals from multiple perspectives:

Perspective What to observe
User perspective Whether the user-facing service path is healthy
Application behavior Service errors, request behavior, and release impact
Middleware systems Dependencies that can slow down or break the service path
Infrastructure Runtime resource pressure, capacity, and scheduling health
Business analysis Whether core product workflows are affected
Tracing and intelligent operations How incidents propagate and where diagnosis should start

For developers operating a cloud log service, this creates a loop: the platform provides observability to customers, and the platform itself must be observable enough to keep that promise.

How does CI/CD reduce regression and release cost?

Runtime modernization is incomplete if release operations remain manual.

CLS used CI/CD and application orchestration to reduce regression and release cost:

  • More than 1,000 automated test cases were built into the CI pipeline, especially from historical issues.
  • Regression checks helped protect compatibility and version stability.
  • Code analysis and unit-test coverage gates supported release quality.
  • Cloud service release work across dozens of regions became less dependent on 2 to 3 people spending time every week on manual tasks.
  • Application orchestration reduced manual error risk and improved release efficiency.

This is the practical boundary of modernization: Kubernetes improves runtime operations, but CI/CD determines whether teams can release safely at platform scale.

What results can this pattern produce?

The CLS cloud-native modernization project took nearly 1 year and went through three major stages. The reported outcomes were:

Result area Reported outcome
Containerization From zero to more than 95% application containerization
Operating cost More than 20 million RMB saved per year
Human effort More than 2 HC reduced
Resource savings More than 100,000 CPU cores saved
Scaling speed Scaling time reduced by 90%
Resource utilization Improved by more than 40%
Service stability 99.99%+
Burst handling Elastic ingestion capacity for PB-level customer scenarios

The broader result was a self-developed architecture around cloud-native technologies such as containers, Kubernetes, declarative APIs, and elastic scaling.

Common pitfalls

Treating containers as a packaging change

If the service remains stateful, configuration remains scattered, release remains manual, and observability remains reactive, the system has not really adopted a cloud-native operating model.

Scaling out without planning scale-in behavior

Fast scale-out protects traffic spikes, but fast scale-in can create instability when CPU or traffic briefly drops and rises again. For logging platforms, slow scale-in is often safer.

Migrating traffic without rollback design

A canary plan is incomplete if the old service cannot be kept alive, compatibility cannot be validated, or traffic cannot be switched back.

Letting configuration drift survive the migration

Configuration drift can follow a system into Kubernetes. A single trusted source, version history, variables, and CI/CD deployment are part of the migration, not an afterthought.

Observing customer workloads but not the platform itself

A log service may help customers troubleshoot production, but it also needs its own user-perspective, application, middleware, infrastructure, dashboard, analysis, tracing, and intelligent operations signals.

FAQ

How do I modernize a logging platform without breaking log ingestion?

Start by separating runtime modernization from traffic migration. Standardize the infrastructure through containers and Kubernetes, convert services toward stateless or near-stateless behavior, keep rollback ready, and move traffic gradually with canary validation.

Why does a high-throughput log service need Kubernetes HPA?

Logging traffic can jump suddenly because customer business traffic also changes suddenly. HPA helps the service scale out for traffic bursts, reduce idle peak capacity, and preserve stability when it uses the right metrics and slow scale-in behavior.

What should I check before moving a stateful logging service to Kubernetes?

Check whether requests depend on a specific instance, where session or business state is stored, whether multiple instances can synchronize data, and whether centralized storage plus local cache can make instances replaceable.

How should configuration be managed during cloud-native migration?

Use one trusted configuration source, keep change history, define variables for similar configuration files, and deploy configuration through CI/CD together with code. This reduces drift across services and regions.

What is a safe canary rollout pattern for replacing an old logging architecture?

Start with smaller regions and selected customers, validate compatibility, keep the old service for 2 weeks after the new service takes over, and maintain a rollback path so traffic can be switched back if needed.

When should Tencent Cloud CLS or Cloud Log Service capabilities be used in this pattern?

Use Tencent Cloud CLS when the platform needs managed log collection, PB-level ingestion, log storage, search, statistical analysis, data processing, and consumption subscription capabilities, especially when logs feed alerting, monitoring, troubleshooting, or downstream data workflows.

Final checklist

Before modernizing a large-scale logging platform, make sure the plan answers these questions:

  • Can the platform absorb hundreds of thousands of QPS or GB/s-level log writes without manual capacity planning?
  • Can logs remain searchable within the required second-level latency target?
  • Which services must become stateless or near-stateless before HPA is safe?
  • Where is the single trusted source for configuration?
  • How will canary, compatibility validation, and rollback work?
  • Which custom metrics should guide scaling beyond CPU?
  • Where do buffering, retry, rate limiting, isolation, degradation, and fallback apply?
  • Which observability signals prove the platform is healthy from user, application, middleware, and infrastructure perspectives?
  • Which historical incidents should become CI regression cases?
  • Which release steps across regions can be handled through application orchestration?

The main takeaway: full containerization is the visible part of the work. The durable value comes from redesigning the logging platform around Kubernetes operations, governed configuration, safe rollout, elastic scaling, layered traffic protection, end-to-end observability, and automated delivery.

Top comments (0)