UModel Data Governance: Practice of Building an O&M World Model

#ai #productivity

This article introduces UModel, Alibaba Cloud's ontology that transforms observability into a unified model-driven digital twin of IT systems.
Start Afresh from Observability
Looking back at the evolution of observability over the past few decades, what we see is fundamental changes in how we understand complex systems.

In the early days, we built monitoring systems that were centered on single data types, such as CPU, memory, and disk. In those systems, isolated metrics showed what went wrong. As system complexity increased, we began to collect multiple types of data in parallel, including logs, metrics, and traces, as we attempted to observe a single system from different angles.

The next step was recognizing that these data types are correlated. This led to correlation analysis across logs, metrics, and traces, which were the so-called three pillars of observability. However, these correlations were often simple patchworks from the time, resource, or call chain dimension without deep semantic connections, and they remained shallow.

All of these progressive paths share an implicit assumption: collect data first, and then infer the true status of the relevant system from the data. As distributed systems grow more complex, this cognitive path of "from data to world" reveals fundamental limits.

The core challenge is that inferring real-world system status from massive, heterogeneous, and constantly changing data is essentially a highly difficult reverse engineering issue. Data represents observable phenomena, but wide cognitive gaps exist between phenomena and reality.

If we review this issue from first principles, a more effective path emerges, which is to build a model for the real or IT world first, and then collect data in a focused manner. This represents a shift from data-driven to model-driven observability, changing from observing surface symptoms to modeling underlying essence.

Why Modeling Becomes Essential in the AI Era
Large language models (LLMs) give us powerful reasoning capabilities, but in practice, we have found that directly connecting these models as "brains" to raw observability data streams produces limited results and even negative effects. Without a structured understanding of specific domains, AI systems can barely make accurate judgments when fed with data that lacks context and relationships. This limit is especially pronounced in observability.

Drawing on years of experience in observability and especially artificial intelligence for IT operations (AIOps), the Alibaba Cloud observability team began to model observability data in 2019, continuously iterating and refining the approach. This evolution went through the following key stages:

1.Data standardization: addressed inconsistent data formats by adopting the OpenTelemetry open standard, enabling interoperability across different observability data types.
2.Entities and relationships: introduced the concept of entity set to describe core objects in the IT world, and used links to describe relationships such as invocation, dependency, and ownership, forming a topology for IT systems.
3.Knowledge representation: distilled expertise in different domains and expressed it by using semantic sets, so that a system contained not only data, but also context and methods for analyzing data.
4.Mobility: carried out modeling for operational actions, capturing the action knowledge of "how to respond" as part of the knowledge base, so that a system could support a closed loop from observation to analysis to action.

These stages mirror our experimentation and refinement of AIOps. As AI rapidly progresses from theory to implementation to real-world impact, we released UModel, a structured model for the O&M world that connects data, logic, and actions. UModel provides AI systems with knowledge graphs available for reasoning and interaction.

UModel: A Generalizable Digital Ontology for O&M
UModel is a unified modeling framework in the IT world built from the observability domain and grounded in ontological thinking. It is not just an abstraction of data, but a complete system that integrates data, knowledge, and actions into a whole.

This modeling approach is similar to that of industry-leading Palantir, which started in the defense and security domain and gradually became a leader in general-purpose AI applications, with its market value growing rapidly. If we take a close look at Palantir's success, we can see that its core value does not lie in AI algorithms, but in the practical implementation of an underlying ontology.

The preceding figure shows an ontology similar to that of Palantir, illustrating the capabilities of an observability-driven digital O&M platform centered on UModel.

Inspired by the design philosophy of Palantir, UModel uses a graph model to unify the three core elements of the IT world:

1.Data layer: covers global entities and observability data, with a strong focus on various types of relationships.
2.Knowledge layer: accumulates domain expertise, including O&M knowledge, analysis templates, and domain-specific models.
3.Action layer: defines actions that support automated decision execution.

Why UModel Is Practical and Sustainable
The idea of modeling is not new. Almost any enterprise can propose a vision of a unified data model. However, when such an idea is applied in real enterprise environments and expected to deliver long-term value, it often falls into common traps: technology-first mindset, project-driven delivery, treating data lakes as data capability, and focusing on analysis while neglecting actions. These traps may eventually lead to issues such as data silos and cognitive gaps, insights disconnected from actions, and intelligent technologies detached from real business needs.

UModel is able to be released and adopted because it systematically implements the full chain from concept to practice.

Business value-oriented system design: The Alibaba Cloud observability team serves hundreds of thousands of customers, and the observability domain itself is highly fragmented. From the beginning, UModel was designed based on the core pain points of real observability scenarios. Modeling was carried out to resolve concrete issues such as O&M efficiency, troubleshooting, and system optimization, rather than just finishing the modeling job.
Building of an ecosystem that can continuously evolve: UModel is not a "one-time delivery" project. Our goal is to build an ecosystem with evolution capabilities. The development stages described earlier are the results of this evolution. Backed by a meta model architecture, new entity types, data types, and relationship types can be smoothly extended without impacting existing systems.
A dual flywheel that delivers both customer value and platform capabilities: The team focuses on building a closed loop for customers that covers data retrieval, value extraction, decision-making, execution, and validation. At the same time, the team continuously enhances the observability platform capabilities across UModel, agents, and domain-specific models, forming a reinforced dual flywheel that delivers both customer value and platform capabilities.
UModel Overview: Build the IT World with Sets and Links
The design of UModel is aligned with the concept of ontology in information science: uses graphs to describe objects and relationships, and uses fields to constrain semantics and data quality.

Core Set Types

Core Link Types

Typical Examples: An Application Calls a Database
1.Define two entity sets, Service and DB, and create a calls relationship of Service → DB.
2.Define Telemetry data sets, including ServiceLogs, ServiceMetrics, and Traces, to describe service data and data generated by calls.
3.Create data links between the Telemetry data sets and entity sets.
4.Define concrete instances and their storage and storage links.
5.Define runbooks for Service and DB, describing runbooks, analysis templates, and domain knowledge.

Based on this modeling, we can directly identify the Service and DB instances in the system, the data they generate, the storage locations and field semantics, and the correlated runbooks, analysis templates, and domain knowledge. This can enable further unified analysis for humans, programs, and LLMs.

UModel as an Engineering System: From Concepts to System Engineering
UModel is not just a static definition, but a complete engineering system.

● Meta model architecture: UModel does not directly define concepts such as field, data set, or entity link. Instead, it provides a meta model. Based on the meta model, basic fields are defined and associated by using combination, reference, and inheritance to generate concrete UModel definitions. This ensures scalability. The overall evolution follows the idea of "The Tao gives birth to One. One gives birth to Two. Two gives birth to Three. Three gives birth to all things."

● Standardization and automation: Standardized UModel definitions for observability are provided to greatly lower the barrier for adoption. An internal common schema governance mechanism ensures high-quality model evolution, and automation capabilities are provided for entity relationship generation, data synchronization, and metadata updates, improving operability.

● Technical infrastructure: Storage and compute engines are specifically designed for entities and relationships in observability scenarios. The engines support fast backtracking to system status at any point in time and enable high-performance graph queries, relationship computation, and multidimensional analysis. A pipeline-based, high-performance Structured Process Language (SPL) unified computing framework is also provided to jointly process UModel data, entities, relationships, and observability data.

● Object-oriented analysis: Analysis evolves from traditional data-oriented syntax to an object-oriented, programmatic pattern, supporting advanced features such as runtime polymorphism and dynamic method invocation. This makes analysis closer to object interaction patterns in the real world.

● Artificial general intelligence (AGI)-oriented capabilities: UModel integrates AI-oriented core capabilities such as semantic search and graph retrieval-augmented generation (GraphRAG). At upper layers, Model Context Protocol (MCP) and PaaS APIs are provided, with LLM-friendly designs considered from all aspects.

In practical implementation, we further enhance usability and production capabilities.

● Common schema: Common data structures are accumulated for Alibaba Cloud services, Kubernetes, and application performance management (APM), and compatibility with standards such as OpenTelemetry is maintained.

● Automatic entity relationship generation: Based on prior knowledge and computing frameworks, cross-domain relationships are automatically generated from application and business views to resource and cloud service views.

● Analysis best practices: Best practices are attached to Telemetry data sets, including out-of-the-box dashboards and alerts.

Observability Applications and Practices Based on UModel
The preceding figure shows a typical UModel configuration. In the system, a series of entities are defined, including application-layer services and service instances, Kubernetes-related entities, underlying databases and hosts, and jobs, repositories, and operators related to continuous integration and continuous delivery (CI/CD). At the same time, the observability data corresponding to these entities is also modeled separately. (Due to space limits, related explorers are not described in this article. In practice, standard explorers such as Kubernetes, MySQL, and service dashboards are also defined for quick data access, along with alert configurations.) All of these entities and instances are connected in the form of a graph.

Based on this UModel configuration, we can construct a concrete entity topology, as shown in the following figure (only some sample entities are displayed due to space limits). When an alert is triggered, multiple troubleshooting methods are available:

1.Step-by-step tracing based on entity relationships: Each entity type has corresponding observability data, allowing analysis by following graphs.
2.Focused troubleshooting based on prior knowledge: For example, when a service becomes abnormal, we can first check the calls to its dependent services, databases, and middleware.
3.Change-based analysis: As most incidents are caused by changes, recent service changes can be examined first, with priority put on the services involved in those changes.
4.Alert correlation across entities: During incidents, alerts are usually generated by multiple entities. We can correlate all entities with alerts generated and draw the minimal connection subgraph to help with troubleshooting.

All of these troubleshooting methods, regardless of whether they are performed by humans, programs, or LLMs, can work smoothly based on the standardized definition provided by UModel, fully unlocking the real value of observability data.

Observability Applications Built on UModel
Similar to the model-view-controller (MVC) pattern, UModel plays the role of a model within our observability team. On top of it, we have fully integrated the existing observability applications of Cloud Monitor, Application Real-Time Monitoring Service (ARMS), and SLS, forming a next-generation "Observability 2.0" that provides the following benefits:

● Covers all data types from the frontend to the gateway and the backend, from applications to middleware and infrastructure, and from O&M data to security and business data.

● Abstracts and models different types of entities and their relationships, and generates cross-domain entity relationships to fully connect all observability entities.

● Enables access to the observability "internet" from any page and in any scenario by using the unified UI of UModel.

● Decouples algorithms and models from specific data sources. Any data described by UModel is supported.

More than UModel: Toward AGI-driven Digital O&M
At its core, UModel represents a paradigm shift from passively collecting data to actively modeling the world. As AGI matures, O&M will evolve from the mode of humans managing machines to a new paradigm in which AI understands and optimizes the world.

● Full contextual awareness: instant understanding of the complete context of a system in any state.

● Proactive system optimization: shift from reactive alert handling to continuous, proactive optimization.

● Predictive architecture evolution: planning ahead based on business trends and automatically completing complex technical migrations.

We believe this vision will ultimately become reality.

"The best way to predict the future is to create it."— Peter Drucker