How Beike Migrated a Large-Scale Observability Platform to CLS

#observability #logging #cloud #devops

Beike operates at a scale where observability is not a dashboard convenience. It is an operations requirement. Beike migrated from self-built operations systems to a new cloud-based observability platform with Tencent Cloud CLS.

The migration problem had three parts:

Original constraint	Detail
Low data linkage	Logs, monitoring, tracing, and other observability data existed in many old systems, with limited connection between systems.
Performance pressure	During daily settlement, write volume could increase by more than 10x. Large business lines already wrote more than 10 billion records per day, and broad queries often timed out.
Data was hard to use	Self-built systems lacked systematic display, consistent formatting, aggregation functions such as IP geolocation, and convenient sharing for dashboard results.

The goal was to build a unified, high-performance, reliable observability platform without heavily invading business logic.

The platform diagram presents a full-stack observability architecture. At the top are data sources such as logs, tracing, metrics, business data, and cloud products. In the middle are data collection, data processing, storage, analysis, dashboards, and alerting. On the output side, the platform supports data sharing, operational dashboards, and AI analysis. This is not only a log search migration; it is a unification of operations data.

Data ingestion: reduce delay while keeping existing collection logic

The first pain point was write delay. During settlement peaks, delayed reporting was unacceptable because teams needed same-day data for verification and incident response.

The first assumption was that expanding cloud resources would solve the delay, but the effect was limited. Further analysis by the CLS and Beike teams found that the bottleneck was mainly in the rdkafka component used by FluentD Kafka output. Tuning rdkafka alone could no longer satisfy Beike's scale.

CLS then developed a Fluentd Output plugin, published to the community. Data-reporting delay dropped from more than ten minutes to within one minute.

Peak write throughput reaches around 300 GB/min. This is the scale context for the ingestion redesign: the platform needed to absorb traffic bursts rather than only handle average write volume.

Multi-source ingestion without replacing every collector

Beike's environment included Prometheus-based metrics, SkyWalking-based tracing, and mixed ES/Loki-style log systems for network, business, security, and other logs. Most environments had already moved to containers, and FluentD was widely used for log collection, but each business department had its own collection logic.

The easiest migration path was to keep the existing collection method and change the target endpoint where possible.

The architecture uses five ingestion lanes:

business logs collected by fluentd are written through the Kafka protocol;
security logs collected by winlogbeat are written through the Kafka protocol;
tracing data from skywalking is written through an API path;
TKE audit logs collected by loglistener are collected through an agent path;
metrics written through SDKs are ingested as cloud-product log data.

This explains why the migration was low-intrusion: teams could preserve much of the existing collection stack while moving storage, search, and analysis into CLS.

Beike also configured traffic-change alerts for key business modules so traffic shifts could be detected before they became harder incidents.

Data processing: structure raw logs before storage

Beike had many business departments, which meant log formats were inconsistent. A central parser would not be enough; different business lines needed configurable parsing rules.

The CLS data-processing canvas supports visual processing before logs are stored in a topic. In this example, business logs are first split by delimiters and then fields are extracted with regular expressions. The displayed data is simulated.

Data analysis: make massive logs searchable and cheaper to retain

Two related problems appear at this scale: some logs must be stored for a long time due to compliance, while full-volume aggregation over very large datasets hurts analysis efficiency.

The solution combines:

Hybrid storage: short-term hot data supports analysis, while long-term cold data can move to low-frequency storage while still remaining queryable.
Scheduled SQL: complex raw logs are aggregated into business-level metrics and saved for long-term monitoring.

For Beike security logs, Windows event logs from employee office environments were collected into CLS. The security team configured more than one thousand SQL rules to aggregate by rule name, alert level, and host name. Scheduled SQL summarized results every minute, reducing complex logs into the indicators the business cared about.

After switching to CLS, real-time retrieval over more than 50 billion log records averaged only 10 seconds, and retrieval efficiency improved by 6x+ compared with the original system.

The operational view combines cards, charts, and log records: high-level indicators for scanability, charts for trend review, and raw records for drill-down.

Result display: dashboards and DataSight sharing

Before migration, Beike used open-source display components such as Grafana. Those systems had fixed presentation forms, required complex configuration, and were not convenient for sharing inside domestic office workflows.

After data was collected into CLS, Beike could configure multiple dashboards in the product console and share them to PC or mobile through the independent DataSight console.

This dashboard contains multi-dimensional charts such as traffic trend, distribution, and summary indicators. The displayed data is simulated, but the workflow is the real point: business teams can monitor operations through reusable dashboards instead of repeated ad hoc searches.

The summary visual reinforces the platform's role across real-time network dashboards, operations dashboards, multi-end sharing, and reporting. It connects the technical migration to daily operations usage.

Access control and smooth user migration

Beike already had more than one thousand independent R&D users in its internal operations platform, with permission boundaries by business area. Creating Tencent Cloud accounts for everyone was unrealistic.

CLS DataSight solved this through an embedded, independent console:

it can be embedded into the existing internal system;
it supports internal and external network access modes;
it provides an independent log entry and customizable account-password login;
it can connect to the user's LDAP system and inherit existing permission logic.

Reported results

The migration outcomes are:

more than one thousand business sections were connected to CLS in one person-day;
old and new systems switched smoothly without changing user habits;
10x peak write traffic dropped from more than ten minutes of delay to minute-level latency;
overall business efficiency improved by 20x;
retrieval over tens of billions of logs moved from minute-level to second-level;
retrieval efficiency improved by 6x+;
dashboards and traffic-change alerts made operations more visible and proactive.

Reusable migration pattern

The Beike case suggests a practical sequence for large observability migrations:

identify whether the true bottleneck is storage, query, collector output, or parsing;
preserve existing collection protocols where possible;
route logs, tracing, audit, and metrics into one analysis platform;
structure logs before storage through visual processing rules;
use scheduled SQL to turn massive raw logs into long-lived metrics;
separate hot and cold storage to balance cost and query requirements;
expose dashboards through an access model that matches the organization's existing identity system.