ObservabilityGuy

Posted on Dec 3, 2025

From Data Silos to Intelligent Insights: Building a Future-oriented Operation Intelligence System

#monitoring #devops #architecture #machinelearning

From Data Silos to Intelligent Insights: Building a Future-oriented Operation Intelligence System
As the digital world continues to operate around the clock, systems are generating massive amounts of data all the time. We collectively refer to this as operation data. Operation data not only records system performance but also contains key insights that drive business growth, ensure system stability, and prevent security risks. More importantly, such data is both observable and intelligent. R&D engineers, O&M engineers, security experts, and business decision makers are all mining the value within operation data, accumulating industry knowledge. In this article, we focus on the core characteristics of operation data, the challenges organizations may face in leveraging it, and how a systematic approach can help build true operation intelligence capabilities.

Today, the key data of enterprises typically comes from three major dimensions, each playing a unique role:

● Technical data: the "electrocardiogram" of a system. This is the most fundamental and abundant source of data, encompassing logs, metrics, traces, and event alerts. Like a system's vital signs, it reflects real-time conditions such as cluster usage, database load, and service invocation stability. The primary goal of technical data is to ensure business stability. It is the core focus of O&M teams and forms the foundation of system observability.

● Business data: the "growth engine" of an enterprise. This includes data from scenarios such as user behaviors, transaction flows, marketing campaigns, and customer relationship management (CRM). Such data enables real-time evaluation of performance, such as how many new users were acquired from a campaign, how users use new features, and whether there is a risk of user churn. Business data directly correlates with commercial outcomes. It guides product iterations and validates the effectiveness of market strategies, serving as the driving force behind business growth.

● Security data: the "immune system" of an enterprise. This covers security logs, access control records, and intrusion detection alerts. Such data helps identify abnormal logins, suspicious operations, and potential attacks, allowing enterprises to detect data breaches or internal compliance risks at the earliest opportunity. The value of security data lies in proactive risk prevention, ensuring that enterprises grow steadily while maintaining compliance.

Despite the importance of each data category, we must recognize one thing: Looking at a single category in isolation often leads to incomplete or even misleading insights. The real value lies in connecting data from different dimensions and allowing them to connect with and validate one another. For example:

● When we identify key users, can we combine their business behaviors with their technical experience to offer service guarantees with a higher priority?

● During a marketing campaign, can we detect malicious operations by linking sudden spikes in traffic with abnormal login patterns?

● When a system experiences performance fluctuations, can we analyze infrastructure metrics, user reports and feedback, and security access logs together to locate the root cause?

Only by deeply integrating technical, business, and security data can we shift from reactive response to proactive prediction and truly enable data-driven intelligent decision making.

Looking back, the way enterprises handle data has gone through several typical stages:

● Manual era – "firefighting"-style O&M: In the early days, troubleshooting depended on manually logging in to jump servers, reviewing logs line by line, and verifying issues one by one. Each engineer was a "firefighter", responses were slow, efficiency was low, and results depended heavily on personal experience.

● Script era – early automation: As monitoring scripts emerged, teams began to automate parts of the process and proactively detect known issues. However, this led to a new problem: alert storms. Large volumes of low-value alerts buried critical ones, overwhelming O&M teams. More importantly, scripts could only detect known problems, offering no help for new ones.

● Platform era – logically separated while physically aggregated: With the rise of data platforms, enterprises began to aggregate data centrally, reducing data access challenges. However, most platforms achieved only physical aggregation, not semantic integration. Cross-domain analysis still required manual operations by experts, making communication costly and analysis slow, which was far from supporting real-time decision making.

● AI-driven era – breaking cognitive barriers: Today, AI offers us countless possibilities. We are no longer satisfied with the pattern of humans finding problems, but expect AI to help humans find problems, or even AI to find what humans never thought of.

To achieve this transformation, we must face a key reality: Raw data is not naturally AI-ready. To make better use of AI, we need to deeply understand the core traits of each data category.

Although these three categories of data come from different sources, they share the same pain points: enormous volume but low information density, rapid changes that models cannot easily adapt to, and lack of context, making intent hard to judge from single records. AI models, on the other hand, require highly correlated, semantically clear, and concise data. However, most of the raw data today are far from meeting this standard. When we try to apply AI to operations scenarios, three major gaps often appear:

● Data gap: Raw data is fragmented, noisy, and unstructured. Over 99% of the data may be irrelevant, preventing AI from finding meaningful signals.

● Model gap: AI models are often seen as black boxes with opaque reasoning, and may produce hallucinations, which are plausible but false results.

● Engineering gap: Managing the collection, cleansing, storage, and computation of petabytes of data on a daily basis places huge demands on performance, cost, and security.

These challenges are deeply intertwined, creating breakpoints in AI value realization. Tackling them one by one rarely resolves the fundamental problem.

Breaking Through the Barriers: Building a Systematic "Data Alchemy" Framework
To bridge these gaps, we need a systematic methodology—a process we call "data alchemy". Just like extracting metal from ore, this process transforms low-density raw data into high-value intelligent signals. It consists of three key steps:

Unified Foundation: Building an Integrated Data Platform
All intelligent operations rely on a unified, reliable, and high-performance data storage and processing platform. We have built an observability infrastructure based on a distributed architecture that fully supports real-time ingestion and persistent storage of multimodal data such as logs, metrics, and traces. Since last year, we have upgraded all environments to a high-availability architecture of three zones, ensuring data reliability and security with zero extra cost and complete transparency to users.

This year, we are introducing the UModel modeling mechanism, which enables semantic correlation and unified modeling across different data domains, truly breaking down data silos.

Deep Refinement: Increasing Information Density
Raw data needs to go through multiple layers of processing before it can become high-quality signals available to AI:

● Structured extraction: uses pattern recognition and parsing technologies to extract key information such as entities, metrics, and events from unstructured logs.

● Context completion: combines domain knowledge to precisely map technical IDs (such as trace_id) to business IDs (such as user_id), enriching the data semantics.

● Semantic enhancement: uses the embedding technology to generate vector representations that support natural language-level semantic retrieval. This way, newly written data can be semantically matched within 10 seconds.

Only when data carries complete context and semantic meaning does subsequent analysis become valuable.

Intelligent Signal Generation: Unlocking the Potential of AI
After integration and refinement, the processed data becomes an ideal input for AI models. Whether for anomaly detection, root cause analysis, trend prediction, or risk warning, AI processing accuracy improves dramatically. We are now working to build a "far-sighted" capability framework, enabling systems not only to detect problems but also to predict problems, explain problems, and recommend solutions.

To support this vision, we have built an end-to-end, one-stop observability platform, Simple Log Service (SLS). Its core mission is to provide a "god's-eye view", where all the data you need is available in one place, ready for you to query, correlate, and analyze at any time. This means there is no need to switch between multiple systems or manually concatenate data. A single query can traverse logs, metrics, and traces to complete correlation analysis, while the platform automatically aligns context and unifies semantics and no manual integration is required. This is not just a feature upgrade. It represents a fundamental shift in mindset. We are moving from a rule-driven era of automation to a data-driven, AI-enhanced era of intelligence. In this new paradigm, data is no longer a passive record of events but an active source of value creation. The future belongs to enterprises that can sense faster, understand deeper, and foresee earlier. What we are building is precisely the kind of intelligent infrastructure that enables organizations to develop a "digital sixth sense".

Building an Integrated Observability Platform: Making Data Truly Queryable, Connectable, and Analyzable
In the previous section, we explored the core idea behind operation intelligence: integrating technology data, business data, and security data to move from passive response to proactive insights. To realize this vision, a powerful, unified, and intelligent data foundation is required.

In this section, we will take a closer look at our practical work in platform capability development—how we are building a truly one-stop observability platform with a "god's-eye view".

Unified Storage: Building a High-availability, High-performance Data Foundation
Since last year, SLS has been fully upgraded to a high-availability architecture of three zones across all supported environments. This ensures cross-region disaster recovery, durability, and reliability. The upgrade process is fully transparent to users and incurs no additional cost. On top of this distributed storage foundation, we have achieved real-time collection and processing of multi-source data. SLS supports open source agents, is compatible with various custom data formats, and aggregates data into the platform within milliseconds. Different types of observability data, such as logs, metrics, traces, and events, are stored in layers based on their traits and usage scenarios, ensuring on-demand storage and efficient utilization. This year, we are introducing a unified data modeling mechanism that supports semantic-level association and modeling of distributed data sources, laying a solid foundation for subsequent integrated analysis. As we often say, "If data isn't stored well, it can't be retrieved fast. No matter how powerful your computation is, its value won't be realized." In the face of heterogeneous, high-frequency, and massive data, a single storage solution cannot meet all business requirements. To address this, we have designed multiple storage and indexing strategies to adapt to different workloads.

● Inverted indexes: support retrieval across hundreds of billions of log entries within seconds. For unstructured log data, the inverted index technology enables rapid keyword location. Even for logs at the petabyte scale, query responses can be completed within seconds.

● In-memory acceleration layer: designed for high-concurrency, low-latency analysis. For real-time aggregation and large-scale analytical workloads, hot data is cached in memory, dramatically improving the performance of complex queries.

● Vector indexes: natively support semantic search. The AI era puts forward higher requirements for data understanding. With built-in embedding capabilities, our system automatically vectorizes the data that is written and stores it in the form of vectors. Within 10 seconds after data is written, the data becomes accessible through semantic queries, making natural language search a reality.

● Real-time engine optimization: handles sustained high-throughput workloads. For massive, long-running, high-throughput data streams, we have deeply optimized the underlying real-time engine with built-in compression, downsampling, and end-to-end performance tuning, significantly improving write throughput and read efficiency.

In addition, graph storage and context association provide the foundation for AI reasoning. Beyond traditional storage modes, we have introduced graph storage capabilities to address the problem of missing context. Only when different types of data are semantically connected can AI truly understand their underlying meaning. For example, by associating information such as a user's device, IP address, account, and behavior, we can link isolated security alerts or technical faults into a complete attack chain or fault propagation path. This greatly enhances the reasoning ability of AI in anomaly detection and root cause analysis. This is the essence of our data modeling capability: not just about putting data together, but about making data "know each other".

Intelligent Compute Engine: Balancing Deep Analytics and Extreme Real-time Performance
With a robust storage foundation, the next step is to build a flexible and efficient compute engine. We face two typical types of requirements: deep analytical tasks that involve massive data volumes and complex computations requiring absolute precision, and real-time interactive tasks (such as dashboard displays) that demand low latency and high responsiveness. To meet the requirements, we have implemented several key upgrades at the compute layer:

● Fully accurate mode: breaking resource limits to ensure result integrity. In the past, to control resource consumption, queries were often subject to underlying restrictions that could cause missing records in query results. However, in deep analytical scenarios, even a single missing record can distort conclusions. We re-abstracted this problem by removing all prior resource constraints and keeping only an execution duration limit, which was 10 minutes by default. Within this time scope, the system can split a task into thousands of subtasks, execute them in a distributed pipeline manner across large clusters, and ultimately return complete and accurate results.

● Automatic materialized views: boosting dashboard performance by orders of magnitude. For frequently accessed dashboards or fixed analysis templates, we introduced automatic materialized views. As data flows into the platform, the system proactively performs precomputations, generating intermediate results and storing them as materialized views. When a query is initiated, the engine intelligently determines whether existing results can be reused and dynamically merges usable results with raw data. This means that users do not need to modify SQL statements or manually manage materialized tables. Queries are still written against raw data, and the platform automatically handles acceleration. In real-world applications, some customers' dashboards contain over a dozen charts, each processing more than 20 billion data entries. Previously, these dashboards suffered from long load times and poor responsiveness. After materialized views were enabled, latency decreased by one to two orders of magnitude, resulting in a significantly improved user experience.

Unified Modeling: Enabling AI to Understand the "Complete Story"
True intelligence lies not in identifying isolated anomalies, but in reconstructing the full logic behind an event. For example, consider the following sequence: a user fails multiple remote login attempts, successfully logs in from a previously unseen device, then repeatedly calls a flight search API operation, sharply increasing database CPU usage, and finally redeems a high-value flight ticket. Let us view these events separately:

● Security logs: repeated remote login attempts → suspicious operation?

● Technical metrics: API call surge and CPU spike → fault?

● Business data: reward redemption → normal transaction?

Individually, none of these events can reveal the truth. What if we connect the dots through unified modeling? The whole process is shown: a new device logs on → repeatedly scan flight data → rapidly redeem a valuable ticket. This way, we uncover a clear pattern of a mileage fraud attack. This is the essence of our unified modeling layer, which comprises three core components:

● Entity modeling: defines key objects such as users, devices, and sessions.

● Observability data modeling: describes objects such as logs, metrics, and traces generated by those entities.

● Relationship modeling: captures correlations among entities, including invocations, ownership, and behavioral sequences.

Although the modeling process is manually designed, it primarily serves AI reasoning. It provides the foundation for contextual completion and enables cross-domain analysis.

Unified Query Language SPL: Bridging the Last Mile between Data Silos
Even with unified storage and modeling, analysis remains inefficient if users still need to switch between multiple syntaxes and systems. To address this, we introduced Search Processing Language (SPL). It serves as a unified entry point that abstracts away underlying differences and enables integrated queries across diverse data types, including logs, metrics, traces, graph data, and entity models. More importantly, SPL is not just a query language. It is a capability that spans the entire data lifecycle. At the ingestion stage, SPL can perform field extraction, regular expression parsing, and structural transformation in advance, reducing the downstream processing load. During Flink consumption, SPL allows you to push data processing logic to the platform side, ensuring highly structured outputs. At the frontend, a single SPL query can be used to generate various types of visualized charts, such as time series charts, heatmaps, and topology maps. SPL also provides powerful extensibility. It can call external functions, integrate with LLMs for assisted judgment, and orchestrate advanced analytical workflows such as graph rendering and anomaly detection. Here is an example: A single SPL script automatically extracts key events from logs → generates time series curves → detects anomalies → renders them into visual charts. All of these are completed in a fully automated pipeline.

Let us revisit the earlier mileage fraud case to see how much more efficient analysis becomes on a unified platform. With just a single SPL query, you can enable the following workflow: retrieve the suspect user's entity profile, including common devices and login patterns → correlate the current login behavior, API call frequency, and order operations → identify any combinations of abnormal logins and high-risk operations → feed the structured behavior sequence into an LLM for assisted analysis and recommendation generation.

The entire workflow requires no cross-system operations or team handoffs. From detection and analysis to decision support, everything completes within minutes. Compared to the previous "firefighting" style of manual investigation that relied on multiple tools and teams, efficiency improves by orders of magnitude.

Conclusion: From Tool Integration to Capability Convergence
What we have built today is no longer just a log platform or a monitoring system. It is a one-stop observability platform that integrates storage, compute, modeling, querying, and intelligent analytics. Its core value lies in three pillars:

● Unified: All data is on one platform.

● Connected: All events can be correlated.

● Intelligent: All analyses can be accelerated.

We believe the future of O&M lies not in finding problems, but in foreseeing them. The future of decision-making will no longer rely on experience, but on a complete data-driven context. When data becomes truly visible, connected, and understandable, an enterprise's digital perception is truly awakened.

DEV Community

From Data Silos to Intelligent Insights: Building a Future-oriented Operation Intelligence System

Top comments (0)